Обсуждение: TM format can mix encodings in to_char()

Поиск

Список

Период

Сортировка

TM format can mix encodings in to_char()

От

Juan José Santamaría Flecha

Дата:

12 апреля 2019 г., 19:45:51

Hackers,

I will use as an example the code in the regression test 'collate.linux.utf8'.

There you can find:

SET lc_time TO 'tr_TR';

SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');

to_char

-------------

01 NIS 2010

(1 row)

The problem is that the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5),

while the test runs in UTF8. So the following code will raise an error:

SET lc_time TO 'tr_TR';

SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');

ERROR: invalid byte sequence for encoding "UTF8": 0xde 0x75

The problem seems to be in the code touched in the attached patch.

Regards,

Juan Jose Santamaria Flecha

Вложения

tm-format-mixes-encodings.patch

Re: TM format can mix encodings in to_char()

От

Kyotaro HORIGUCHI

Дата:

19 апреля 2019 г., 11:30:17

Hello.

At Fri, 12 Apr 2019 18:45:51 +0200, Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote in
<CAC+AXB22So5aZm2vZe+MChYXec7gWfr-n-SK-iO091R0P_1Tew@mail.gmail.com>
> Hackers,
>
> I will use as an example the code in the regression test
> 'collate.linux.utf8'.
> There you can find:
>
> SET lc_time TO 'tr_TR';
> SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
>    to_char
> -------------
>  01 NIS 2010
> (1 row)
>
> The problem is that the locale 'tr_TR' uses the encoding ISO-8859-9
> (LATIN5),
> while the test runs in UTF8. So the following code will raise an error:
>
> SET lc_time TO 'tr_TR';
> SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
> ERROR:  invalid byte sequence for encoding "UTF8": 0xde 0x75

The same case is handled for lc_numeric. lc_time ought to be
treated the same way.

> The problem seems to be in the code touched in the attached patch.

It seems basically correct, but cache_single_time does extra
strdup when pg_any_to_server did conversion. Maybe it would be
better be like this:

> oldcxt = MemoryContextSwitchTo(TopMemoryContext);
> ptr = pg_any_to_server(buf, strlen(buf), encoding);
>
> if (ptr == buf)
> {
>     /* Conversion didn't pstrdup, so we must */
>     ptr = pstrdup(buf);
> }
> MemoryContextSwitchTo(oldcxt);

-    int            i;
+    int            i,
+                encoding;

It is not strictly kept, but (I believe) we don't define multiple
variables in a single definition.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: TM format can mix encodings in to_char()

От

Tom Lane

Дата:

19 апреля 2019 г., 19:47:39

=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:
> The problem is that the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5),
> while the test runs in UTF8. So the following code will raise an error:

> SET lc_time TO 'tr_TR';
> SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
> ERROR:  invalid byte sequence for encoding "UTF8": 0xde 0x75

Ugh.

> The problem seems to be in the code touched in the attached patch.

Hmm.  I'd always imagined that the way that libc works is that LC_CTYPE
determines the encoding (codeset) it's using across the board, so that
functions like strftime would deliver data in that encoding.  That's
mainly based on the observation that nl_langinfo(CODESET) is specified
to depend on LC_CTYPE, and it would be monumentally stupid for any libc
functions to be operating according to a codeset that there's no way to
discover.

However, your example shows that at least glibc is indeed
monumentally stupid about this :-(.

But ... perhaps other implementations are not so silly?  I went
looking into the POSIX spec to see if it says anything about this,
and discovered (in Base Definitions section 7, Locale):

    If different character sets are used by the locale categories, the
    results achieved by an application utilizing these categories are
    undefined. Likewise, if different codesets are used for the data being
    processed by interfaces whose behavior is dependent on the current
    locale, or the codeset is different from the codeset assumed when the
    locale was created, the result is also undefined.

"Undefined" is a term of art here: it means the library can misbehave
arbitrarily badly, up to and including abort() or halt-and-catch-fire.
We do *not* want to be invoking undefined behavior, even if particular
implementations seem to behave sanely.  Your proposed patch isn't
getting us out of that, and what it is doing instead is embedding an
assumption that the implementation handles this in a particular way.

So what I'm thinking really needs to be done here is to force it to work
according to the LC_CTYPE-determines-the-codeset-for-everything model.
Note that that model is embedded into PG in quite a few ways besides the
one at stake here; for instance, pg_perm_setlocale thinks it should make
gettext track the LC_CTYPE encoding, not anything else.

If we're willing to assume a lot about how locale names are spelled,
we could imagine fixing this in cache_locale_time by having it strip
any encoding spec from the given LC_TIME string and then adding on the
codeset name from nl_langinfo(CODESET).  Not sure about how well
that'd play on Windows, though.  We'd also need to adjust check_locale
so that it does the same dance.

BTW, it seems very likely that we have similar issues with LC_MONETARY
and LC_NUMERIC in PGLC_localeconv().  There's an interesting Windows-only
hack in there now that seems to be addressing more or less the same issue;
I wonder whether that would be rendered unnecessary if we approached it
like this?

I'm also wondering why we have not noticed any comparable problem with
LC_MESSAGES or LC_COLLATE.  It's not so surprising that we haven't
understood this hazard before with LC_TIME/LC_MONETARY/LC_NUMERIC given
their limited usage in PG, but the same can't be said of LC_MESSAGES or
LC_COLLATE.

            regards, tom lane

Re: TM format can mix encodings in to_char()

От

Tom Lane

Дата:

20 апреля 2019 г., 18:50:01

I wrote:
> Hmm.  I'd always imagined that the way that libc works is that LC_CTYPE
> determines the encoding (codeset) it's using across the board, so that
> functions like strftime would deliver data in that encoding.
> [ and much more based on that ]

After further study of the code, the situation seems less dire than
I feared yesterday.  In the first place, we disallow settings of
LC_COLLATE and LC_CTYPE that don't match the database encoding, see
tests in dbcommands.c's check_encoding_locale_matches() and in initdb.
So that core functionality will be consistent in any case.

Also, I see that PGLC_localeconv() is effectively doing exactly what
you suggested for strings that are encoded according to LC_MONETARY
and LC_NUMERIC:

        encoding = pg_get_encoding_from_locale(locale_monetary, true);

        db_encoding_convert(encoding, &worklconv.int_curr_symbol);
        db_encoding_convert(encoding, &worklconv.currency_symbol);
        ...

This is a little bit off, now that I look at it, because it's
failing to account for the possibility of getting -1 from
pg_get_encoding_from_locale.  It should probably do what
pg_bind_textdomain_codeset does:

    if (encoding < 0)
        encoding = PG_SQL_ASCII;

since passing PG_SQL_ASCII to the conversion will have the effect of
validating the data without any actual conversion.

I remain wary of this idea because it's depending on something that's
undefined per POSIX, but apparently it's working well enough for
LC_MONETARY and LC_NUMERIC, so we can probably get away with it for
LC_TIME as well.  Anyway the current code clearly does not work on
glibc, and I also verified that there's a problem on FreeBSD, so
this patch should make things better.

Also, experimentation suggests that LC_MESSAGES actually does work
the way I thought this stuff works, ie, its implied codeset isn't
really used.  (I think this only matters for strerror(), since we
force the issue for gettext, but glibc's strerror() is clearly not
paying attention to that.)  Sigh, who needs consistency?

            regards, tom lane

Re: TM format can mix encodings in to_char()

От

Tom Lane

Дата:

20 апреля 2019 г., 21:46:12

I wrote:
> This is a little bit off, now that I look at it, because it's
> failing to account for the possibility of getting -1 from
> pg_get_encoding_from_locale.  It should probably do what
> pg_bind_textdomain_codeset does:
>     if (encoding < 0)
>         encoding = PG_SQL_ASCII;

Actually, the more I looked at the existing code, the less happy I got
with its error handling.  cache_locale_time() is an absolute failure
in that respect, because it makes no attempt at all to avoid throwing
errors while LC_TIME or LC_CTYPE is set to a transient value.  We
could maybe tolerate LC_TIME being weird, but continuing with a value
of LC_CTYPE that doesn't match the database setting would almost
certainly be disastrous.

PGLC_localeconv had at least thought about the issue, but it ends up
wimping out:

        /*
         * Report it if we failed to restore anything.  Perhaps this should be
         * FATAL, rather than continuing with bad locale settings?
         */
        if (trouble)
            elog(WARNING, "failed to restore old locale");

And it's also oddly willing to keep going even if it couldn't get the
original setlocale settings to begin with.

I think that this code was written with only LC_MONETARY and LC_NUMERIC
in mind, for which there's at least some small argument that continuing
with unwanted values wouldn't be fatal (though IIRC, LC_NUMERIC would
still change the behavior of float8out).  Since we added code to also
change LC_CTYPE on Windows, I think that continuing on after a restore
failure would be disastrous.  And we've not heard any field reports
of this warning anyway, so there's no reason to think we have to support
the case in practice.

Hence, the attached revised patch changes the code to do elog(FATAL)
if it can't restore any locale settings to their previous values,
and it fixes cache_locale_time to not do anything risky while it's
got transient locale settings in place.

I propose to apply and back-patch this; the code's basically the
same in all supported branches.

            regards, tom lane

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 6e33d65..e607a39 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -490,7 +490,6 @@ PGLC_localeconv(void)
     static bool CurrentLocaleConvAllocated = false;
     struct lconv *extlconv;
     struct lconv worklconv;
-    bool        trouble = false;
     char       *save_lc_monetary;
     char       *save_lc_numeric;
 #ifdef WIN32
@@ -522,14 +521,16 @@ PGLC_localeconv(void)
      */
     memset(&worklconv, 0, sizeof(worklconv));

-    /* Save user's values of monetary and numeric locales */
+    /* Save prevailing values of monetary and numeric locales */
     save_lc_monetary = setlocale(LC_MONETARY, NULL);
-    if (save_lc_monetary)
-        save_lc_monetary = pstrdup(save_lc_monetary);
+    if (!save_lc_monetary)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_monetary = pstrdup(save_lc_monetary);

     save_lc_numeric = setlocale(LC_NUMERIC, NULL);
-    if (save_lc_numeric)
-        save_lc_numeric = pstrdup(save_lc_numeric);
+    if (!save_lc_numeric)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_numeric = pstrdup(save_lc_numeric);

 #ifdef WIN32

@@ -554,10 +555,11 @@ PGLC_localeconv(void)
      * UTF16 and convert from that.
      */

-    /* save user's value of ctype locale */
+    /* Save prevailing value of ctype locale */
     save_lc_ctype = setlocale(LC_CTYPE, NULL);
-    if (save_lc_ctype)
-        save_lc_ctype = pstrdup(save_lc_ctype);
+    if (!save_lc_ctype)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_ctype = pstrdup(save_lc_ctype);

     /* Here begins the critical section where we must not throw error */

@@ -601,27 +603,22 @@ PGLC_localeconv(void)
     worklconv.p_sign_posn = extlconv->p_sign_posn;
     worklconv.n_sign_posn = extlconv->n_sign_posn;

-    /* Try to restore internal settings */
-    if (save_lc_monetary)
-    {
-        if (!setlocale(LC_MONETARY, save_lc_monetary))
-            trouble = true;
-    }
-
-    if (save_lc_numeric)
-    {
-        if (!setlocale(LC_NUMERIC, save_lc_numeric))
-            trouble = true;
-    }
-
+    /*
+     * Restore the prevailing locale settings; failure to do so is fatal.
+     * Possibly we could limp along with nondefault LC_MONETARY or LC_NUMERIC,
+     * but proceeding with the wrong value of LC_CTYPE would certainly be bad
+     * news; and considering that the prevailing LC_MONETARY and LC_NUMERIC
+     * are almost certainly "C", there's really no reason that restoring those
+     * should fail.
+     */
 #ifdef WIN32
-    /* Try to restore internal ctype settings */
-    if (save_lc_ctype)
-    {
-        if (!setlocale(LC_CTYPE, save_lc_ctype))
-            trouble = true;
-    }
+    if (!setlocale(LC_CTYPE, save_lc_ctype))
+        elog(FATAL, "failed to restore LC_CTYPE to \"%s\"", save_lc_ctype);
 #endif
+    if (!setlocale(LC_MONETARY, save_lc_monetary))
+        elog(FATAL, "failed to restore LC_MONETARY to \"%s\"", save_lc_monetary);
+    if (!setlocale(LC_NUMERIC, save_lc_numeric))
+        elog(FATAL, "failed to restore LC_NUMERIC to \"%s\"", save_lc_numeric);

     /*
      * At this point we've done our best to clean up, and can call functions
@@ -632,21 +629,11 @@ PGLC_localeconv(void)
     {
         int            encoding;

-        /*
-         * Report it if we failed to restore anything.  Perhaps this should be
-         * FATAL, rather than continuing with bad locale settings?
-         */
-        if (trouble)
-            elog(WARNING, "failed to restore old locale");
-
         /* Release the pstrdup'd locale names */
-        if (save_lc_monetary)
-            pfree(save_lc_monetary);
-        if (save_lc_numeric)
-            pfree(save_lc_numeric);
+        pfree(save_lc_monetary);
+        pfree(save_lc_numeric);
 #ifdef WIN32
-        if (save_lc_ctype)
-            pfree(save_lc_ctype);
+        pfree(save_lc_ctype);
 #endif

         /* If any of the preceding strdup calls failed, complain now. */
@@ -657,15 +644,22 @@ PGLC_localeconv(void)

         /*
          * Now we must perform encoding conversion from whatever's associated
-         * with the locale into the database encoding.
+         * with the locales into the database encoding.  If we can't identify
+         * the encoding implied by LC_NUMERIC or LC_MONETARY (ie we get -1),
+         * use PG_SQL_ASCII, which will result in just validating that the
+         * strings are OK in the database encoding.
          */
         encoding = pg_get_encoding_from_locale(locale_numeric, true);
+        if (encoding < 0)
+            encoding = PG_SQL_ASCII;

         db_encoding_convert(encoding, &worklconv.decimal_point);
         db_encoding_convert(encoding, &worklconv.thousands_sep);
         /* grouping is not text and does not require conversion */

         encoding = pg_get_encoding_from_locale(locale_monetary, true);
+        if (encoding < 0)
+            encoding = PG_SQL_ASCII;

         db_encoding_convert(encoding, &worklconv.int_curr_symbol);
         db_encoding_convert(encoding, &worklconv.currency_symbol);
@@ -758,29 +752,29 @@ strftime_win32(char *dst, size_t dstlen,
 #define strftime(a,b,c,d) strftime_win32(a,b,c,d)
 #endif                            /* WIN32 */

-/* Subroutine for cache_locale_time(). */
+/*
+ * Subroutine for cache_locale_time().
+ * Convert the given string from encoding "encoding" to the database
+ * encoding, and store the result at *dst, replacing any previous value.
+ */
 static void
-cache_single_time(char **dst, const char *format, const struct tm *tm)
+cache_single_string(char **dst, const char *src, int encoding)
 {
-    char        buf[MAX_L10N_DATA];
     char       *ptr;
+    char       *olddst;

-    /*
-     * MAX_L10N_DATA is sufficient buffer space for every known locale, and
-     * POSIX defines no strftime() errors.  (Buffer space exhaustion is not an
-     * error.)    An implementation might report errors (e.g. ENOMEM) by
-     * returning 0 (or, less plausibly, a negative value) and setting errno.
-     * Report errno just in case the implementation did that, but clear it in
-     * advance of the call so we don't emit a stale, unrelated errno.
-     */
-    errno = 0;
-    if (strftime(buf, MAX_L10N_DATA, format, tm) <= 0)
-        elog(ERROR, "strftime(%s) failed: %m", format);
+    /* Convert the string to the database encoding, or validate it's OK */
+    ptr = pg_any_to_server(src, strlen(src), encoding);

-    ptr = MemoryContextStrdup(TopMemoryContext, buf);
-    if (*dst)
-        pfree(*dst);
-    *dst = ptr;
+    /* Store the string in long-lived storage, replacing any previous value */
+    olddst = *dst;
+    *dst = MemoryContextStrdup(TopMemoryContext, ptr);
+    if (olddst)
+        pfree(olddst);
+
+    /* Might as well clean up any palloc'd conversion result, too */
+    if (ptr != src)
+        pfree(ptr);
 }

 /*
@@ -789,11 +783,14 @@ cache_single_time(char **dst, const char *format, const struct tm *tm)
 void
 cache_locale_time(void)
 {
-    char       *save_lc_time;
+    char        buf[(2 * 7 + 2 * 12) * MAX_L10N_DATA];
+    char       *bufptr;
     time_t        timenow;
     struct tm  *timeinfo;
+    bool        strftimefail = false;
+    int            encoding;
     int            i;
-
+    char       *save_lc_time;
 #ifdef WIN32
     char       *save_lc_ctype;
 #endif
@@ -804,10 +801,18 @@ cache_locale_time(void)

     elog(DEBUG3, "cache_locale_time() executed; locale: \"%s\"", locale_time);

-    /* save user's value of time locale */
+    /*
+     * As in PGLC_localeconv(), it's critical that we not throw error while
+     * libc's locale settings have nondefault values.  Hence, we just call
+     * strftime() within the critical section, and then convert and save its
+     * results afterwards.
+     */
+
+    /* Save prevailing value of time locale */
     save_lc_time = setlocale(LC_TIME, NULL);
-    if (save_lc_time)
-        save_lc_time = pstrdup(save_lc_time);
+    if (!save_lc_time)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_time = pstrdup(save_lc_time);

 #ifdef WIN32

@@ -820,10 +825,11 @@ cache_locale_time(void)
      * PGLC_localeconv().
      */

-    /* save user's value of ctype locale */
+    /* Save prevailing value of ctype locale */
     save_lc_ctype = setlocale(LC_CTYPE, NULL);
-    if (save_lc_ctype)
-        save_lc_ctype = pstrdup(save_lc_ctype);
+    if (!save_lc_ctype)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_ctype = pstrdup(save_lc_ctype);

     /* use lc_time to set the ctype */
     setlocale(LC_CTYPE, locale_time);
@@ -831,15 +837,33 @@ cache_locale_time(void)

     setlocale(LC_TIME, locale_time);

+    /* We use times close to current time as data for strftime(). */
     timenow = time(NULL);
     timeinfo = localtime(&timenow);

+    /* Store the strftime results in MAX_L10N_DATA-sized portions of buf[] */
+    bufptr = buf;
+
+    /*
+     * MAX_L10N_DATA is sufficient buffer space for every known locale, and
+     * POSIX defines no strftime() errors.  (Buffer space exhaustion is not an
+     * error.)    An implementation might report errors (e.g. ENOMEM) by
+     * returning 0 (or, less plausibly, a negative value) and setting errno.
+     * Report errno just in case the implementation did that, but clear it in
+     * advance of the calls so we don't emit a stale, unrelated errno.
+     */
+    errno = 0;
+
     /* localized days */
     for (i = 0; i < 7; i++)
     {
         timeinfo->tm_wday = i;
-        cache_single_time(&localized_abbrev_days[i], "%a", timeinfo);
-        cache_single_time(&localized_full_days[i], "%A", timeinfo);
+        if (strftime(bufptr, MAX_L10N_DATA, "%a", timeinfo) <= 0)
+            strftimefail = true;
+        bufptr += MAX_L10N_DATA;
+        if (strftime(bufptr, MAX_L10N_DATA, "%A", timeinfo) <= 0)
+            strftimefail = true;
+        bufptr += MAX_L10N_DATA;
     }

     /* localized months */
@@ -847,27 +871,66 @@ cache_locale_time(void)
     {
         timeinfo->tm_mon = i;
         timeinfo->tm_mday = 1;    /* make sure we don't have invalid date */
-        cache_single_time(&localized_abbrev_months[i], "%b", timeinfo);
-        cache_single_time(&localized_full_months[i], "%B", timeinfo);
+        if (strftime(bufptr, MAX_L10N_DATA, "%b", timeinfo) <= 0)
+            strftimefail = true;
+        bufptr += MAX_L10N_DATA;
+        if (strftime(bufptr, MAX_L10N_DATA, "%B", timeinfo) <= 0)
+            strftimefail = true;
+        bufptr += MAX_L10N_DATA;
     }

-    /* try to restore internal settings */
-    if (save_lc_time)
+    /*
+     * Restore the prevailing locale settings; as in PGLC_localeconv(),
+     * failure to do so is fatal.
+     */
+#ifdef WIN32
+    if (!setlocale(LC_CTYPE, save_lc_ctype))
+        elog(FATAL, "failed to restore LC_CTYPE to \"%s\"", save_lc_ctype);
+#endif
+    if (!setlocale(LC_TIME, save_lc_time))
+        elog(FATAL, "failed to restore LC_TIME to \"%s\"", save_lc_time);
+
+    /*
+     * At this point we've done our best to clean up, and can throw errors, or
+     * call functions that might throw errors, with a clean conscience.
+     */
+    if (strftimefail)
+        elog(ERROR, "strftime() failed: %m");
+
+    /* Release the pstrdup'd locale names */
+    pfree(save_lc_time);
+#ifdef WIN32
+    pfree(save_lc_ctype);
+#endif
+
+    /*
+     * As in PGLC_localeconv(), we must convert the strings from the encoding
+     * implied by LC_TIME to the database encoding.  If we can't identify the
+     * LC_TIME encoding, just perform encoding validation.
+     */
+    encoding = pg_get_encoding_from_locale(locale_time, true);
+    if (encoding < 0)
+        encoding = PG_SQL_ASCII;
+
+    bufptr = buf;
+
+    /* localized days */
+    for (i = 0; i < 7; i++)
     {
-        if (!setlocale(LC_TIME, save_lc_time))
-            elog(WARNING, "failed to restore old locale");
-        pfree(save_lc_time);
+        cache_single_string(&localized_abbrev_days[i], bufptr, encoding);
+        bufptr += MAX_L10N_DATA;
+        cache_single_string(&localized_full_days[i], bufptr, encoding);
+        bufptr += MAX_L10N_DATA;
     }

-#ifdef WIN32
-    /* try to restore internal ctype settings */
-    if (save_lc_ctype)
+    /* localized months */
+    for (i = 0; i < 12; i++)
     {
-        if (!setlocale(LC_CTYPE, save_lc_ctype))
-            elog(WARNING, "failed to restore old locale");
-        pfree(save_lc_ctype);
+        cache_single_string(&localized_abbrev_months[i], bufptr, encoding);
+        bufptr += MAX_L10N_DATA;
+        cache_single_string(&localized_full_months[i], bufptr, encoding);
+        bufptr += MAX_L10N_DATA;
     }
-#endif

     CurrentLCTimeValid = true;
 }
diff --git a/src/test/regress/expected/collate.linux.utf8.out b/src/test/regress/expected/collate.linux.utf8.out
index 15b3222..619688f 100644
--- a/src/test/regress/expected/collate.linux.utf8.out
+++ b/src/test/regress/expected/collate.linux.utf8.out
@@ -430,6 +430,18 @@ SELECT relname FROM pg_class WHERE relname ~* '^abc';

 -- to_char
 SET lc_time TO 'tr_TR';
+SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
+   to_char
+-------------
+ 01 ŞUB 2010
+(1 row)
+
+SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR");
+   to_char
+-------------
+ 01 ŞUB 2010
+(1 row)
+
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
    to_char
 -------------
diff --git a/src/test/regress/sql/collate.linux.utf8.sql b/src/test/regress/sql/collate.linux.utf8.sql
index 4ca02b8..c009fd2 100644
--- a/src/test/regress/sql/collate.linux.utf8.sql
+++ b/src/test/regress/sql/collate.linux.utf8.sql
@@ -169,6 +169,8 @@ SELECT relname FROM pg_class WHERE relname ~* '^abc';
 -- to_char

 SET lc_time TO 'tr_TR';
+SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
+SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR");
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR");

Re: TM format can mix encodings in to_char()

От

Tom Lane

Дата:

21 апреля 2019 г., 07:28:48

I wrote:
> [ fix-encoding-and-error-recovery-in-cache-locale-time.patch ]

On closer inspection, I'm pretty sure either version of this patch
will break things on Windows, because that platform already had code
to convert the result of wcsftime() to the database encoding; we
were adding code to do a second conversion, which will not go well.

The attached revised patch deletes the no-longer-necessary
platform-specific recoding stanza, in favor of having cache_locale_time
know that it's getting UTF8 rather than something else.  I also
updated a bunch of the related comments.

I don't have any way to test this on Windows, so could somebody
do that?  Manually running the Turkish test cases ought to be enough.

            regards, tom lane

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 6e33d65..da6e2cb 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -22,8 +22,9 @@
  * settable at run-time.  However, we don't actually set those locale
  * categories permanently.  This would have bizarre effects like no
  * longer accepting standard floating-point literals in some locales.
- * Instead, we only set the locales briefly when needed, cache the
- * required information obtained from localeconv(), and set them back.
+ * Instead, we only set these locale categories briefly when needed,
+ * cache the required information obtained from localeconv() or
+ * strftime(), and then set the locale categories back to "C".
  * The cached information is only used by the formatting functions
  * (to_char, etc.) and the money type.  For the user, this should all be
  * transparent.
@@ -42,7 +43,7 @@
  * will change the memory save is pointing at.  To do this sort of thing
  * safely, you *must* pstrdup what setlocale returns the first time.
  *
- * FYI, The Open Group locale standard is defined here:
+ * The POSIX locale standard is available here:
  *
  *    http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
  *----------
@@ -490,7 +491,6 @@ PGLC_localeconv(void)
     static bool CurrentLocaleConvAllocated = false;
     struct lconv *extlconv;
     struct lconv worklconv;
-    bool        trouble = false;
     char       *save_lc_monetary;
     char       *save_lc_numeric;
 #ifdef WIN32
@@ -522,42 +522,38 @@ PGLC_localeconv(void)
      */
     memset(&worklconv, 0, sizeof(worklconv));

-    /* Save user's values of monetary and numeric locales */
+    /* Save prevailing values of monetary and numeric locales */
     save_lc_monetary = setlocale(LC_MONETARY, NULL);
-    if (save_lc_monetary)
-        save_lc_monetary = pstrdup(save_lc_monetary);
+    if (!save_lc_monetary)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_monetary = pstrdup(save_lc_monetary);

     save_lc_numeric = setlocale(LC_NUMERIC, NULL);
-    if (save_lc_numeric)
-        save_lc_numeric = pstrdup(save_lc_numeric);
+    if (!save_lc_numeric)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_numeric = pstrdup(save_lc_numeric);

 #ifdef WIN32

     /*
-     * Ideally, monetary and numeric local symbols could be returned in any
-     * server encoding.  Unfortunately, the WIN32 API does not allow
-     * setlocale() to return values in a codepage/CTYPE that uses more than
-     * two bytes per character, such as UTF-8:
+     * The POSIX standard explicitly says that it is undefined what happens if
+     * LC_MONETARY or LC_NUMERIC imply an encoding (codeset) different from
+     * that implied by LC_CTYPE.  In practice, all Unix-ish platforms seem to
+     * believe that localeconv() should return strings that are encoded in the
+     * codeset implied by the LC_MONETARY or LC_NUMERIC locale name.  Hence,
+     * once we have successfully collected the localeconv() results, we will
+     * convert them from that codeset to the desired server encoding.
      *
-     * http://msdn.microsoft.com/en-us/library/x99tb11d.aspx
-     *
-     * Evidently, LC_CTYPE allows us to control the encoding used for strings
-     * returned by localeconv().  The Open Group standard, mentioned at the
-     * top of this C file, doesn't explicitly state this.
-     *
-     * Therefore, we set LC_CTYPE to match LC_NUMERIC or LC_MONETARY (which
-     * cannot be UTF8), call localeconv(), and then convert from the
-     * numeric/monetary LC_CTYPE to the server encoding.  One example use of
-     * this is for the Euro symbol.
-     *
-     * Perhaps someday we will use GetLocaleInfoW() which returns values in
-     * UTF16 and convert from that.
+     * Windows, of course, resolutely does things its own way; on that
+     * platform LC_CTYPE has to match LC_MONETARY/LC_NUMERIC to get sane
+     * results.  Hence, we must temporarily set that category as well.
      */

-    /* save user's value of ctype locale */
+    /* Save prevailing value of ctype locale */
     save_lc_ctype = setlocale(LC_CTYPE, NULL);
-    if (save_lc_ctype)
-        save_lc_ctype = pstrdup(save_lc_ctype);
+    if (!save_lc_ctype)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_ctype = pstrdup(save_lc_ctype);

     /* Here begins the critical section where we must not throw error */

@@ -601,27 +597,22 @@ PGLC_localeconv(void)
     worklconv.p_sign_posn = extlconv->p_sign_posn;
     worklconv.n_sign_posn = extlconv->n_sign_posn;

-    /* Try to restore internal settings */
-    if (save_lc_monetary)
-    {
-        if (!setlocale(LC_MONETARY, save_lc_monetary))
-            trouble = true;
-    }
-
-    if (save_lc_numeric)
-    {
-        if (!setlocale(LC_NUMERIC, save_lc_numeric))
-            trouble = true;
-    }
-
+    /*
+     * Restore the prevailing locale settings; failure to do so is fatal.
+     * Possibly we could limp along with nondefault LC_MONETARY or LC_NUMERIC,
+     * but proceeding with the wrong value of LC_CTYPE would certainly be bad
+     * news; and considering that the prevailing LC_MONETARY and LC_NUMERIC
+     * are almost certainly "C", there's really no reason that restoring those
+     * should fail.
+     */
 #ifdef WIN32
-    /* Try to restore internal ctype settings */
-    if (save_lc_ctype)
-    {
-        if (!setlocale(LC_CTYPE, save_lc_ctype))
-            trouble = true;
-    }
+    if (!setlocale(LC_CTYPE, save_lc_ctype))
+        elog(FATAL, "failed to restore LC_CTYPE to \"%s\"", save_lc_ctype);
 #endif
+    if (!setlocale(LC_MONETARY, save_lc_monetary))
+        elog(FATAL, "failed to restore LC_MONETARY to \"%s\"", save_lc_monetary);
+    if (!setlocale(LC_NUMERIC, save_lc_numeric))
+        elog(FATAL, "failed to restore LC_NUMERIC to \"%s\"", save_lc_numeric);

     /*
      * At this point we've done our best to clean up, and can call functions
@@ -632,21 +623,11 @@ PGLC_localeconv(void)
     {
         int            encoding;

-        /*
-         * Report it if we failed to restore anything.  Perhaps this should be
-         * FATAL, rather than continuing with bad locale settings?
-         */
-        if (trouble)
-            elog(WARNING, "failed to restore old locale");
-
         /* Release the pstrdup'd locale names */
-        if (save_lc_monetary)
-            pfree(save_lc_monetary);
-        if (save_lc_numeric)
-            pfree(save_lc_numeric);
+        pfree(save_lc_monetary);
+        pfree(save_lc_numeric);
 #ifdef WIN32
-        if (save_lc_ctype)
-            pfree(save_lc_ctype);
+        pfree(save_lc_ctype);
 #endif

         /* If any of the preceding strdup calls failed, complain now. */
@@ -657,15 +638,22 @@ PGLC_localeconv(void)

         /*
          * Now we must perform encoding conversion from whatever's associated
-         * with the locale into the database encoding.
+         * with the locales into the database encoding.  If we can't identify
+         * the encoding implied by LC_NUMERIC or LC_MONETARY (ie we get -1),
+         * use PG_SQL_ASCII, which will result in just validating that the
+         * strings are OK in the database encoding.
          */
         encoding = pg_get_encoding_from_locale(locale_numeric, true);
+        if (encoding < 0)
+            encoding = PG_SQL_ASCII;

         db_encoding_convert(encoding, &worklconv.decimal_point);
         db_encoding_convert(encoding, &worklconv.thousands_sep);
         /* grouping is not text and does not require conversion */

         encoding = pg_get_encoding_from_locale(locale_monetary, true);
+        if (encoding < 0)
+            encoding = PG_SQL_ASCII;

         db_encoding_convert(encoding, &worklconv.int_curr_symbol);
         db_encoding_convert(encoding, &worklconv.currency_symbol);
@@ -693,15 +681,15 @@ PGLC_localeconv(void)

 #ifdef WIN32
 /*
- * On WIN32, strftime() returns the encoding in CP_ACP (the default
- * operating system codpage for that computer), which is likely different
+ * On Windows, strftime() returns its output in encoding CP_ACP (the default
+ * operating system codepage for the computer), which is likely different
  * from SERVER_ENCODING.  This is especially important in Japanese versions
  * of Windows which will use SJIS encoding, which we don't support as a
  * server encoding.
  *
  * So, instead of using strftime(), use wcsftime() to return the value in
- * wide characters (internally UTF16) and then convert it to the appropriate
- * database encoding.
+ * wide characters (internally UTF16) and then convert to UTF8, which we
+ * know how to handle directly.
  *
  * Note that this only affects the calls to strftime() in this file, which are
  * used to get the locale-aware strings. Other parts of the backend use
@@ -712,10 +700,13 @@ strftime_win32(char *dst, size_t dstlen,
                const char *format, const struct tm *tm)
 {
     size_t        len;
-    wchar_t        wformat[8];        /* formats used below need 3 bytes */
+    wchar_t        wformat[8];        /* formats used below need 3 chars */
     wchar_t        wbuf[MAX_L10N_DATA];

-    /* get a wchar_t version of the format string */
+    /*
+     * Get a wchar_t version of the format string.  We only actually use
+     * plain-ASCII formats in this file, so we can say that they're UTF8.
+     */
     len = MultiByteToWideChar(CP_UTF8, 0, format, -1,
                               wformat, lengthof(wformat));
     if (len == 0)
@@ -726,7 +717,7 @@ strftime_win32(char *dst, size_t dstlen,
     if (len == 0)
     {
         /*
-         * strftime failed, possibly because the result would not fit in
+         * wcsftime failed, possibly because the result would not fit in
          * MAX_L10N_DATA.  Return 0 with the contents of dst unspecified.
          */
         return 0;
@@ -739,17 +730,6 @@ strftime_win32(char *dst, size_t dstlen,
              GetLastError());

     dst[len] = '\0';
-    if (GetDatabaseEncoding() != PG_UTF8)
-    {
-        char       *convstr = pg_any_to_server(dst, len, PG_UTF8);
-
-        if (convstr != dst)
-        {
-            strlcpy(dst, convstr, dstlen);
-            len = strlen(dst);
-            pfree(convstr);
-        }
-    }

     return len;
 }
@@ -758,29 +738,29 @@ strftime_win32(char *dst, size_t dstlen,
 #define strftime(a,b,c,d) strftime_win32(a,b,c,d)
 #endif                            /* WIN32 */

-/* Subroutine for cache_locale_time(). */
+/*
+ * Subroutine for cache_locale_time().
+ * Convert the given string from encoding "encoding" to the database
+ * encoding, and store the result at *dst, replacing any previous value.
+ */
 static void
-cache_single_time(char **dst, const char *format, const struct tm *tm)
+cache_single_string(char **dst, const char *src, int encoding)
 {
-    char        buf[MAX_L10N_DATA];
     char       *ptr;
+    char       *olddst;

-    /*
-     * MAX_L10N_DATA is sufficient buffer space for every known locale, and
-     * POSIX defines no strftime() errors.  (Buffer space exhaustion is not an
-     * error.)    An implementation might report errors (e.g. ENOMEM) by
-     * returning 0 (or, less plausibly, a negative value) and setting errno.
-     * Report errno just in case the implementation did that, but clear it in
-     * advance of the call so we don't emit a stale, unrelated errno.
-     */
-    errno = 0;
-    if (strftime(buf, MAX_L10N_DATA, format, tm) <= 0)
-        elog(ERROR, "strftime(%s) failed: %m", format);
+    /* Convert the string to the database encoding, or validate it's OK */
+    ptr = pg_any_to_server(src, strlen(src), encoding);

-    ptr = MemoryContextStrdup(TopMemoryContext, buf);
-    if (*dst)
-        pfree(*dst);
-    *dst = ptr;
+    /* Store the string in long-lived storage, replacing any previous value */
+    olddst = *dst;
+    *dst = MemoryContextStrdup(TopMemoryContext, ptr);
+    if (olddst)
+        pfree(olddst);
+
+    /* Might as well clean up any palloc'd conversion result, too */
+    if (ptr != src)
+        pfree(ptr);
 }

 /*
@@ -789,11 +769,14 @@ cache_single_time(char **dst, const char *format, const struct tm *tm)
 void
 cache_locale_time(void)
 {
-    char       *save_lc_time;
+    char        buf[(2 * 7 + 2 * 12) * MAX_L10N_DATA];
+    char       *bufptr;
     time_t        timenow;
     struct tm  *timeinfo;
+    bool        strftimefail = false;
+    int            encoding;
     int            i;
-
+    char       *save_lc_time;
 #ifdef WIN32
     char       *save_lc_ctype;
 #endif
@@ -804,26 +787,33 @@ cache_locale_time(void)

     elog(DEBUG3, "cache_locale_time() executed; locale: \"%s\"", locale_time);

-    /* save user's value of time locale */
+    /*
+     * As in PGLC_localeconv(), it's critical that we not throw error while
+     * libc's locale settings have nondefault values.  Hence, we just call
+     * strftime() within the critical section, and then convert and save its
+     * results afterwards.
+     */
+
+    /* Save prevailing value of time locale */
     save_lc_time = setlocale(LC_TIME, NULL);
-    if (save_lc_time)
-        save_lc_time = pstrdup(save_lc_time);
+    if (!save_lc_time)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_time = pstrdup(save_lc_time);

 #ifdef WIN32

     /*
-     * On WIN32, there is no way to get locale-specific time values in a
-     * specified locale, like we do for monetary/numeric.  We can only get
-     * CP_ACP (see strftime_win32) or UTF16.  Therefore, we get UTF16 and
-     * convert it to the database locale.  However, wcsftime() internally uses
-     * LC_CTYPE, so we set it here.  See the WIN32 comment near the top of
-     * PGLC_localeconv().
+     * On Windows, it appears that wcsftime() internally uses LC_CTYPE, so we
+     * must set it here.  This code looks the same as what PGLC_localeconv()
+     * does, but the underlying reason is different: this does NOT determine
+     * the encoding we'll get back from strftime_win32().
      */

-    /* save user's value of ctype locale */
+    /* Save prevailing value of ctype locale */
     save_lc_ctype = setlocale(LC_CTYPE, NULL);
-    if (save_lc_ctype)
-        save_lc_ctype = pstrdup(save_lc_ctype);
+    if (!save_lc_ctype)
+        elog(ERROR, "setlocale(NULL) failed");
+    save_lc_ctype = pstrdup(save_lc_ctype);

     /* use lc_time to set the ctype */
     setlocale(LC_CTYPE, locale_time);
@@ -831,15 +821,33 @@ cache_locale_time(void)

     setlocale(LC_TIME, locale_time);

+    /* We use times close to current time as data for strftime(). */
     timenow = time(NULL);
     timeinfo = localtime(&timenow);

+    /* Store the strftime results in MAX_L10N_DATA-sized portions of buf[] */
+    bufptr = buf;
+
+    /*
+     * MAX_L10N_DATA is sufficient buffer space for every known locale, and
+     * POSIX defines no strftime() errors.  (Buffer space exhaustion is not an
+     * error.)  An implementation might report errors (e.g. ENOMEM) by
+     * returning 0 (or, less plausibly, a negative value) and setting errno.
+     * Report errno just in case the implementation did that, but clear it in
+     * advance of the calls so we don't emit a stale, unrelated errno.
+     */
+    errno = 0;
+
     /* localized days */
     for (i = 0; i < 7; i++)
     {
         timeinfo->tm_wday = i;
-        cache_single_time(&localized_abbrev_days[i], "%a", timeinfo);
-        cache_single_time(&localized_full_days[i], "%A", timeinfo);
+        if (strftime(bufptr, MAX_L10N_DATA, "%a", timeinfo) <= 0)
+            strftimefail = true;
+        bufptr += MAX_L10N_DATA;
+        if (strftime(bufptr, MAX_L10N_DATA, "%A", timeinfo) <= 0)
+            strftimefail = true;
+        bufptr += MAX_L10N_DATA;
     }

     /* localized months */
@@ -847,27 +855,78 @@ cache_locale_time(void)
     {
         timeinfo->tm_mon = i;
         timeinfo->tm_mday = 1;    /* make sure we don't have invalid date */
-        cache_single_time(&localized_abbrev_months[i], "%b", timeinfo);
-        cache_single_time(&localized_full_months[i], "%B", timeinfo);
+        if (strftime(bufptr, MAX_L10N_DATA, "%b", timeinfo) <= 0)
+            strftimefail = true;
+        bufptr += MAX_L10N_DATA;
+        if (strftime(bufptr, MAX_L10N_DATA, "%B", timeinfo) <= 0)
+            strftimefail = true;
+        bufptr += MAX_L10N_DATA;
     }

-    /* try to restore internal settings */
-    if (save_lc_time)
+    /*
+     * Restore the prevailing locale settings; as in PGLC_localeconv(),
+     * failure to do so is fatal.
+     */
+#ifdef WIN32
+    if (!setlocale(LC_CTYPE, save_lc_ctype))
+        elog(FATAL, "failed to restore LC_CTYPE to \"%s\"", save_lc_ctype);
+#endif
+    if (!setlocale(LC_TIME, save_lc_time))
+        elog(FATAL, "failed to restore LC_TIME to \"%s\"", save_lc_time);
+
+    /*
+     * At this point we've done our best to clean up, and can throw errors, or
+     * call functions that might throw errors, with a clean conscience.
+     */
+    if (strftimefail)
+        elog(ERROR, "strftime() failed: %m");
+
+    /* Release the pstrdup'd locale names */
+    pfree(save_lc_time);
+#ifdef WIN32
+    pfree(save_lc_ctype);
+#endif
+
+#ifndef WIN32
+
+    /*
+     * As in PGLC_localeconv(), we must convert strftime()'s output from the
+     * encoding implied by LC_TIME to the database encoding.  If we can't
+     * identify the LC_TIME encoding, just perform encoding validation.
+     */
+    encoding = pg_get_encoding_from_locale(locale_time, true);
+    if (encoding < 0)
+        encoding = PG_SQL_ASCII;
+
+#else
+
+    /*
+     * On Windows, strftime_win32() always returns UTF8 data, so convert from
+     * that if necessary.
+     */
+    encoding = PG_UTF8;
+
+#endif                            /* WIN32 */
+
+    bufptr = buf;
+
+    /* localized days */
+    for (i = 0; i < 7; i++)
     {
-        if (!setlocale(LC_TIME, save_lc_time))
-            elog(WARNING, "failed to restore old locale");
-        pfree(save_lc_time);
+        cache_single_string(&localized_abbrev_days[i], bufptr, encoding);
+        bufptr += MAX_L10N_DATA;
+        cache_single_string(&localized_full_days[i], bufptr, encoding);
+        bufptr += MAX_L10N_DATA;
     }

-#ifdef WIN32
-    /* try to restore internal ctype settings */
-    if (save_lc_ctype)
+    /* localized months */
+    for (i = 0; i < 12; i++)
     {
-        if (!setlocale(LC_CTYPE, save_lc_ctype))
-            elog(WARNING, "failed to restore old locale");
-        pfree(save_lc_ctype);
+        cache_single_string(&localized_abbrev_months[i], bufptr, encoding);
+        bufptr += MAX_L10N_DATA;
+        cache_single_string(&localized_full_months[i], bufptr, encoding);
+        bufptr += MAX_L10N_DATA;
     }
-#endif

     CurrentLCTimeValid = true;
 }
diff --git a/src/test/regress/expected/collate.linux.utf8.out b/src/test/regress/expected/collate.linux.utf8.out
index 15b3222..619688f 100644
--- a/src/test/regress/expected/collate.linux.utf8.out
+++ b/src/test/regress/expected/collate.linux.utf8.out
@@ -430,6 +430,18 @@ SELECT relname FROM pg_class WHERE relname ~* '^abc';

 -- to_char
 SET lc_time TO 'tr_TR';
+SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
+   to_char
+-------------
+ 01 ŞUB 2010
+(1 row)
+
+SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR");
+   to_char
+-------------
+ 01 ŞUB 2010
+(1 row)
+
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
    to_char
 -------------
diff --git a/src/test/regress/sql/collate.linux.utf8.sql b/src/test/regress/sql/collate.linux.utf8.sql
index 4ca02b8..c009fd2 100644
--- a/src/test/regress/sql/collate.linux.utf8.sql
+++ b/src/test/regress/sql/collate.linux.utf8.sql
@@ -169,6 +169,8 @@ SELECT relname FROM pg_class WHERE relname ~* '^abc';
 -- to_char

 SET lc_time TO 'tr_TR';
+SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
+SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR");
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR");

Re: TM format can mix encodings in to_char()

От

Andrew Dunstan

Дата:

21 апреля 2019 г., 16:25:48

On 4/21/19 12:28 AM, Tom Lane wrote:
> I wrote:
>> [ fix-encoding-and-error-recovery-in-cache-locale-time.patch ]
> On closer inspection, I'm pretty sure either version of this patch
> will break things on Windows, because that platform already had code
> to convert the result of wcsftime() to the database encoding; we
> were adding code to do a second conversion, which will not go well.
>
> The attached revised patch deletes the no-longer-necessary
> platform-specific recoding stanza, in favor of having cache_locale_time
> know that it's getting UTF8 rather than something else.  I also
> updated a bunch of the related comments.
>
> I don't have any way to test this on Windows, so could somebody
> do that?  Manually running the Turkish test cases ought to be enough.
>
>             


How does one do that? Just set a Turkish locale?


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: TM format can mix encodings in to_char()

От

Tom Lane

Дата:

21 апреля 2019 г., 17:21:19

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 4/21/19 12:28 AM, Tom Lane wrote:
>> I don't have any way to test this on Windows, so could somebody
>> do that?  Manually running the Turkish test cases ought to be enough.

> How does one do that? Just set a Turkish locale?

Try variants of the original test case.  For instance, in a UTF8
database,

regression=# show server_encoding ;
 server_encoding 
-----------------
 UTF8
(1 row)

regression=# SET lc_time TO 'tr_TR.iso88599';
SET
regression=# SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
   to_char    
--------------
 01 ŞUB 2010
(1 row)

Unpatched, I get an error about invalid data.  Now, this is in
a Linux machine, and you'll have to adapt it for Windows --- at
least change the LC_TIME setting.  But the idea is to get out some
non-ASCII strings from an LC_TIME setting that names an encoding
different from the database's.

(I suspect you'll find that the existing code works fine on
Windows, it's only the first version(s) of this patch that fail.)

            regards, tom lane

Re: TM format can mix encodings in to_char()

От

Peter Geoghegan

Дата:

22 апреля 2019 г., 21:51:48

On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
> How does one do that? Just set a Turkish locale?

tr_TR is, in a sense, special among locales:

http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html

The Turkish dotless i has apparently been implicated in all kinds of
bugs in quite a variety of contexts.

-- 
Peter Geoghegan

Re: TM format can mix encodings in to_char()

От

Tom Lane

Дата:

22 апреля 2019 г., 22:16:03

Peter Geoghegan <pg@bowt.ie> writes:
> On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>> How does one do that? Just set a Turkish locale?

> tr_TR is, in a sense, special among locales:
> http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html
> The Turkish dotless i has apparently been implicated in all kinds of
> bugs in quite a variety of contexts.

Yeah, we've had our share of those :-(.  But the dotless i is not the
problem here --- it happens to not trigger an encoding conversion
issue, it seems.  Amusingly, the existing test case for lc_time = tr_TR
in collate.linux.utf8.sql is specifically coded to check what happens
with dotted/dotless i, and yet it manages to not trip over this problem.
(I suspect the reason is that what comes out of strftime is "Nis" which
is ASCII, and the non-ASCII characters only arise from subsequent case
conversion within PG.)

            regards, tom lane

Re: TM format can mix encodings in to_char()

От

Juan José Santamaría Flecha

Дата:

23 апреля 2019 г., 00:41:57

Actually, I tried to show my findings with the tr_TR regression test, but you

can reproduce the same issue with other locales and non-ASCII characters, as

Tom has pointed out.

For exampe:

de_DE ISO-8859-1: March

es_ES ISO-8859-1: Wednesday

fr_FR ISO-8859-1: February

Regards,

Juan José Santamaría Flecha

Re: TM format can mix encodings in to_char()

От

Andrew Dunstan

Дата:

24 апреля 2019 г., 01:07:25

On 4/21/19 10:21 AM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On 4/21/19 12:28 AM, Tom Lane wrote:
>>> I don't have any way to test this on Windows, so could somebody
>>> do that?  Manually running the Turkish test cases ought to be enough.
>> How does one do that? Just set a Turkish locale?
> Try variants of the original test case.  For instance, in a UTF8
> database,
>
> regression=# show server_encoding ;
>  server_encoding 
> -----------------
>  UTF8
> (1 row)
>
> regression=# SET lc_time TO 'tr_TR.iso88599';
> SET
> regression=# SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
>    to_char    
> --------------
>  01 ŞUB 2010
> (1 row)
>
> Unpatched, I get an error about invalid data.  Now, this is in
> a Linux machine, and you'll have to adapt it for Windows --- at
> least change the LC_TIME setting.  But the idea is to get out some
> non-ASCII strings from an LC_TIME setting that names an encoding
> different from the database's.
>
> (I suspect you'll find that the existing code works fine on
> Windows, it's only the first version(s) of this patch that fail.)
>
>             



Test above works as expected with the patch, see attached.  This is from
jacana.


LMK if you want more tests run before I blow the test instance away


cheers


andrew



-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

testme.out

Re: TM format can mix encodings in to_char()

От

Tom Lane

Дата:

24 апреля 2019 г., 01:10:25

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> Test above works as expected with the patch, see attached.  This is from
> jacana.

Great, thanks for checking!

> LMK if you want more tests run before I blow the test instance away

Can't think of anything else.

It'd be nice if we could cover stuff like this in the regression tests,
but I'm not sure how, seeing that the locale names are platform-dependent
and the overall behavior will also depend on the database encoding ...

            regards, tom lane

Re: TM format can mix encodings in to_char()

От

Juanjo Santamaria Flecha

Дата:

29 июня 2019 г., 09:21:37

It looks as if no work is left for this patch, so maybe updating the author to Tom Lane (I'm just a repoter at this
point,which it's fine) and the status to ready for committer would better reflect its current status. Does anyone think
otherwise?

Regards,

Juan José Santamaría Flecha

Re: TM format can mix encodings in to_char()

От

Tom Lane

Дата:

29 июня 2019 г., 16:58:42

Juanjo Santamaria Flecha <juanjo.santamaria@gmail.com> writes:
> It looks as if no work is left for this patch, so maybe updating the author to Tom Lane (I'm just a repoter at this
point,which it's fine) and the status to ready for committer would better reflect its current status. Does anyone think
otherwise?

Yeah, this was dealt with in 7ad1cd31b et al.  I didn't realize there
was a CF entry for it, or I would have closed it then.  I've done so now.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: TM format can mix encodings in to_char()

Вложения

Вложения