Обсуждение: BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

Поиск
Список
Период
Сортировка

BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

От
digoal@126.com
Дата:
The following bug has been logged on the website:

Bug reference:      9210
Logged by:          digoal.zhou
Email address:      digoal@126.com
PostgreSQL version: 9.3.2
Operating system:   CentOS 6.4
Description:

In PostgreSQL , in addition sql_ascii coded character set encoding does not
check the legitimacy of the other characters are encoded legality checks .
This has clear instructions in the PostgreSQL manual:
Reference
http://www.postgresql.org/docs/9.3/static/multibyte.html
http://blog.163.com/digoal @ 126/blog/static/163877040201211281407682 /
http://blog.163.com/digoal @ 126/blog/static/16387704020132150381348 /
http://blog.163.com/digoal @ 126/blog/static/1638770402011718112210516 /
For example, we are using a database server UTF8 character set , then the
time stored UTF8 database checks whether the agreement , if it does not it
will error .
For example:
ERROR: invalid byte sequence for encoding "UTF8": 0xee 0xc1 0x22
It would appear that illegal characters should not be stored in the
database, but recently doing some data migration discovered this problem ,
the two databases on two servers, operating systems consistent LANG, are
UTF8, database the characters are also UTF8, but after export data to
another database into some of the data that is reported such an error , and
therefore can not normally migrate some data .
>From the face of it , it should be stored in the database of the illegal
character . Then since there are checks , but also how to store into it?
I used here convert_from this function to restore this phenomenon.
We see , in addition to sql_ascii will convert character sets.
postgres = # select t, t :: bytea from convert_from ('\ xeec1', 'gbk') as g
(t);
 t | t
---- + ----------
 He | \ xe79b8d
(1 row)
postgres = # select t, t :: bytea from convert_from ('\ xeec1', 'utf8') as g
(t);
ERROR: invalid byte sequence for encoding "UTF8": 0xee 0xc1
But truthfully use sql_ascii is stored byte stream , without any
conversion.
postgres = # select t, t :: bytea from convert_from ('\ xeec1', 'sql_ascii')
as g (t);
 t | t
--- + --------
   | \ Xeec1
(1 row)
Using this method, the illegal byte stream into the database .
postgres = # create table test (info text);
CREATE TABLE
postgres = # insert into test values ​​(convert_from ('\ xeec1',
'sql_ascii'));
INSERT 0 1
postgres = # select info, info :: bytea from test;
 info | info
------ + --------
      | \ Xeec1
(1 row)
This record came out , then the backup is not restored , as follows:
postgres @ db-192-168-173-55-> pg_dump-t test-a | psql-f -
SET
SET
SET
SET
SET
SET
psql: <stdin>: 19: ERROR: invalid byte sequence for encoding "UTF8": 0xee
0xc1
CONTEXT: COPY test, line 1

Can be used directly in the database, for example :
postgres = # insert into test select * from test;
INSERT 0 1
postgres = # select info, info :: bytea from test;
 info | info
------ + --------
      | \ Xeec1
      | \ Xeec1
(2 rows)

This problem makes the backup data reduction becomes a hassle .
There may be other ways to put the data into the database illegally , which
resulted in the present situation .
If you want to import the data across databases currently available through
DBLINK or external table to import the illegal character of the problem the
way to use the backup to restore occurs .

[ References ]
1. Src / backend / utils / mb / mbutils.c
2. Src / backend / utils / mb / wchar.c
3. Postgres = # \ df convert *
                              List of functions
   Schema | Name | Result data type | Argument data types | Type
------------ + -------------- + ------------------ + --- ------------------
+ --------
 pg_catalog | convert | bytea | bytea, name, name | normal
 pg_catalog | convert_from | text | bytea, name | normal
 pg_catalog | convert_to | bytea | text, name | normal
(3 rows)

Re: BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

От
Tom Lane
Дата:
digoal@126.com writes:
> select t, t::bytea from convert_from('\xeec1', 'sql_ascii') as g(t);
> [ fails to check that string is valid in database encoding ]

Hm, yeah.  Normal input to the database goes through pg_any_to_server(),
which will apply a validation step if the source encoding is SQL_ASCII
and the destination encoding is something else.  However, pg_convert and
some other places call pg_do_encoding_conversion() directly, and that
function will just quietly do nothing if either encoding is SQL_ASCII.

The minimum-refactoring solution to this would be to tweak
pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but
the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing.

I'm not sure if this would break anything we need to have work,
though.  Thoughts?  Do we want to back-patch such a change?
        regards, tom lane



HI,
   Thanks very much.  
    We use dblink or foreign table migrate datas instead pg_dump now resolve the error data load problem.
--
公益是一辈子的事,I'm Digoal,Just Do It.


At 2014-02-14 04:49:08,"Tom Lane" <tgl@sss.pgh.pa.us> wrote: >digoal@126.com writes: >> select t, t::bytea from convert_from('\xeec1', 'sql_ascii') as g(t); >> [ fails to check that string is valid in database encoding ] > >Hm, yeah.  Normal input to the database goes through pg_any_to_server(), >which will apply a validation step if the source encoding is SQL_ASCII >and the destination encoding is something else.  However, pg_convert and >some other places call pg_do_encoding_conversion() directly, and that >function will just quietly do nothing if either encoding is SQL_ASCII. > >The minimum-refactoring solution to this would be to tweak >pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but >the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing. > >I'm not sure if this would break anything we need to have work, >though.  Thoughts?  Do we want to back-patch such a change? > > regards, tom lane


Re: BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

От
Tom Lane
Дата:
I wrote:
> digoal@126.com writes:
>> select t, t::bytea from convert_from('\xeec1', 'sql_ascii') as g(t);
>> [ fails to check that string is valid in database encoding ]

> Hm, yeah.  Normal input to the database goes through pg_any_to_server(),
> which will apply a validation step if the source encoding is SQL_ASCII
> and the destination encoding is something else.  However, pg_convert and
> some other places call pg_do_encoding_conversion() directly, and that
> function will just quietly do nothing if either encoding is SQL_ASCII.

> The minimum-refactoring solution to this would be to tweak
> pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but
> the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing.

> I'm not sure if this would break anything we need to have work,
> though.  Thoughts?  Do we want to back-patch such a change?

I looked through all the callers of pg_do_encoding_conversion(), and
AFAICS this change is a good idea.  There are a whole bunch of places
that use pg_do_encoding_conversion() to convert from the database encoding
to encoding X (most usually UTF8), and right now if you do that in a
SQL_ASCII database you have no assurance whatever that what is produced
is actually valid in encoding X.  I think we need to close that loophole.

I found one place --- utf_u2e() in plperl_helpers.h --- that is aware of
the lack of checking and forces a pg_verify_mbstr call for itself; but
it apparently is concerned about whether the source data is actually utf8
in the first place, which I think is not really
pg_do_encoding_conversion's bailiwick.  I'm okay with
pg_do_encoding_conversion being a no-op if src_encoding == dest_encoding.

Barring objections, I will fix and back-patch this.
        regards, tom lane



I wrote:
> I looked through all the callers of pg_do_encoding_conversion(), and
> AFAICS this change is a good idea.  There are a whole bunch of places
> that use pg_do_encoding_conversion() to convert from the database encoding
> to encoding X (most usually UTF8), and right now if you do that in a
> SQL_ASCII database you have no assurance whatever that what is produced
> is actually valid in encoding X.  I think we need to close that loophole.

BTW, a second look confirms that all but one or two of the callers of
pg_do_encoding_conversion() are in fact converting either to or from
the database encoding.  That probably means they ought to be using
pg_any_to_server or pg_server_to_any instead, so that they can take
advantage of the pre-cached default conversion in the not-unlikely
situation that the other encoding is the current client_encoding.

This would imply that we ought to put a validate-if-source-is-SQL_ASCII
step into pg_server_to_any() as well, since that function currently
short-circuits calling pg_do_encoding_conversion() in that case.
        regards, tom lane



Re: BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

От
Tom Lane
Дата:
I wrote:
>> The minimum-refactoring solution to this would be to tweak
>> pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but
>> the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing.

>> I'm not sure if this would break anything we need to have work,
>> though.  Thoughts?  Do we want to back-patch such a change?

> I looked through all the callers of pg_do_encoding_conversion(), and
> AFAICS this change is a good idea.  There are a whole bunch of places
> that use pg_do_encoding_conversion() to convert from the database encoding
> to encoding X (most usually UTF8), and right now if you do that in a
> SQL_ASCII database you have no assurance whatever that what is produced
> is actually valid in encoding X.  I think we need to close that loophole.

The more I looked into mbutils.c, the less happy I got.  The attached
proposed patch takes care of the missing-verification hole in
pg_do_encoding_conversion() and pg_server_to_any(), and also gets rid
of what I believe to be obsolete provisions in pg_do_encoding_conversion
to "work" if called outside a transaction --- if you consider it working
to completely fail to honor its API contract.  That should no longer be
necessary now that we perform client<->server encoding conversions via
perform_default_encoding_conversion rather than here.

For testing I inserted an "Assert(IsTransactionState());" at the top of
pg_do_encoding_conversion(), and couldn't trigger it, so I'm fairly sure
this change is OK.  Still, back-patching it might not be a good thing.
On the other hand, if there is still any code path that can get here
outside a transaction, we probably ought to find out rather than allow
completely bogus data to be returned.

I also made some more-cosmetic improvements, notably removing a bunch
of Asserts that are certainly dead code because the relevant variables
are never NULL.

I've not done anything yet about simplifying unnecessary calls of
pg_do_encoding_conversion into pg_server_to_any/pg_any_to_server.

How much of this is back-patch material, do you think?

            regards, tom lane

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 08440e9..7f43cae 100644
*** a/src/backend/utils/mb/mbutils.c
--- b/src/backend/utils/mb/mbutils.c
***************
*** 1,10 ****
! /*
!  * This file contains public functions for conversion between
!  * client encoding and server (database) encoding.
   *
!  * Tatsuo Ishii
   *
!  * src/backend/utils/mb/mbutils.c
   */
  #include "postgres.h"

--- 1,36 ----
! /*-------------------------------------------------------------------------
   *
!  * mbutils.c
!  *      This file contains functions for encoding conversion.
   *
!  * The string-conversion functions in this file share some API quirks.
!  * Note the following:
!  *
!  * The functions return a palloc'd, null-terminated string if conversion
!  * is required.  However, if no conversion is performed, the given source
!  * string pointer is returned as-is.
!  *
!  * Although the presence of a length argument means that callers can pass
!  * non-null-terminated strings, care is required because the same string
!  * will be passed back if no conversion occurs.  Such callers *must* check
!  * whether result == src and handle that case differently.
!  *
!  * If the source and destination encodings are the same, the source string
!  * is returned without any verification; it's assumed to be valid data.
!  * If that might not be the case, the caller is responsible for validating
!  * the string using a separate call to pg_verify_mbstr().  Whenever the
!  * source and destination encodings are different, the functions ensure that
!  * the result is validly encoded according to the destination encoding.
!  *
!  *
!  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
!  * Portions Copyright (c) 1994, Regents of the University of California
!  *
!  *
!  * IDENTIFICATION
!  *      src/backend/utils/mb/mbutils.c
!  *
!  *-------------------------------------------------------------------------
   */
  #include "postgres.h"

*************** InitializeClientEncoding(void)
*** 290,296 ****
  int
  pg_get_client_encoding(void)
  {
-     Assert(ClientEncoding);
      return ClientEncoding->encoding;
  }

--- 316,321 ----
*************** pg_get_client_encoding(void)
*** 300,328 ****
  const char *
  pg_get_client_encoding_name(void)
  {
-     Assert(ClientEncoding);
      return ClientEncoding->name;
  }

  /*
!  * Apply encoding conversion on src and return it. The encoding
!  * conversion function is chosen from the pg_conversion system catalog
!  * marked as "default". If it is not found in the schema search path,
!  * it's taken from pg_catalog schema. If it even is not in the schema,
!  * warn and return src.
!  *
!  * If conversion occurs, a palloc'd null-terminated string is returned.
!  * In the case of no conversion, src is returned.
!  *
!  * CAUTION: although the presence of a length argument means that callers
!  * can pass non-null-terminated strings, care is required because the same
!  * string will be passed back if no conversion occurs.    Such callers *must*
!  * check whether result == src and handle that case differently.
   *
!  * Note: we try to avoid raising error, since that could get us into
!  * infinite recursion when this function is invoked during error message
!  * sending.  It should be OK to raise error for overlength strings though,
!  * since the recursion will come with a shorter message.
   */
  unsigned char *
  pg_do_encoding_conversion(unsigned char *src, int len,
--- 325,337 ----
  const char *
  pg_get_client_encoding_name(void)
  {
      return ClientEncoding->name;
  }

  /*
!  * Convert src string to another encoding (general case).
   *
!  * See the notes about string conversion functions at the top of this file.
   */
  unsigned char *
  pg_do_encoding_conversion(unsigned char *src, int len,
*************** pg_do_encoding_conversion(unsigned char
*** 331,369 ****
      unsigned char *result;
      Oid            proc;

!     if (!IsTransactionState())
!         return src;

      if (src_encoding == dest_encoding)
!         return src;

!     if (src_encoding == PG_SQL_ASCII || dest_encoding == PG_SQL_ASCII)
!         return src;

!     if (len <= 0)
          return src;

      proc = FindDefaultConversionProc(src_encoding, dest_encoding);
      if (!OidIsValid(proc))
!     {
!         ereport(LOG,
                  (errcode(ERRCODE_UNDEFINED_FUNCTION),
                   errmsg("default conversion function for encoding \"%s\" to \"%s\" does not exist",
                          pg_encoding_to_char(src_encoding),
                          pg_encoding_to_char(dest_encoding))));
-         return src;
-     }
-
-     /*
-      * XXX we should avoid throwing errors in OidFunctionCall. Otherwise we
-      * are going into infinite loop!  So we have to make sure that the
-      * function exists before calling OidFunctionCall.
-      */
-     if (!SearchSysCacheExists1(PROCOID, ObjectIdGetDatum(proc)))
-     {
-         elog(LOG, "cache lookup failed for function %u", proc);
-         return src;
-     }

      /*
       * Allocate space for conversion result, being wary of integer overflow
--- 340,371 ----
      unsigned char *result;
      Oid            proc;

!     if (len <= 0)
!         return src;                /* empty string is always valid */

      if (src_encoding == dest_encoding)
!         return src;                /* no conversion required, assume valid */

!     if (dest_encoding == PG_SQL_ASCII)
!         return src;                /* any string is valid in SQL_ASCII */

!     if (src_encoding == PG_SQL_ASCII)
!     {
!         /* No conversion is possible, but we must validate the result */
!         (void) pg_verify_mbstr(dest_encoding, (const char *) src, len, false);
          return src;
+     }
+
+     if (!IsTransactionState())    /* shouldn't happen */
+         elog(ERROR, "cannot perform encoding conversion outside a transaction");

      proc = FindDefaultConversionProc(src_encoding, dest_encoding);
      if (!OidIsValid(proc))
!         ereport(ERROR,
                  (errcode(ERRCODE_UNDEFINED_FUNCTION),
                   errmsg("default conversion function for encoding \"%s\" to \"%s\" does not exist",
                          pg_encoding_to_char(src_encoding),
                          pg_encoding_to_char(dest_encoding))));

      /*
       * Allocate space for conversion result, being wary of integer overflow
*************** pg_do_encoding_conversion(unsigned char
*** 387,393 ****
  }

  /*
!  * Convert string using encoding_name. The source
   * encoding is the DB encoding.
   *
   * BYTEA convert_to(TEXT string, NAME encoding_name) */
--- 389,395 ----
  }

  /*
!  * Convert string to encoding encoding_name. The source
   * encoding is the DB encoding.
   *
   * BYTEA convert_to(TEXT string, NAME encoding_name) */
*************** pg_convert_to(PG_FUNCTION_ARGS)
*** 412,418 ****
  }

  /*
!  * Convert string using encoding_name. The destination
   * encoding is the DB encoding.
   *
   * TEXT convert_from(BYTEA string, NAME encoding_name) */
--- 414,420 ----
  }

  /*
!  * Convert string from encoding encoding_name. The destination
   * encoding is the DB encoding.
   *
   * TEXT convert_from(BYTEA string, NAME encoding_name) */
*************** pg_convert_from(PG_FUNCTION_ARGS)
*** 439,445 ****
  }

  /*
!  * Convert string using encoding_names.
   *
   * BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name)
   */
--- 441,447 ----
  }

  /*
!  * Convert string between two arbitrary encodings.
   *
   * BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name)
   */
*************** pg_convert(PG_FUNCTION_ARGS)
*** 472,479 ****
      src_str = VARDATA_ANY(string);
      pg_verify_mbstr_len(src_encoding, src_str, len, false);

!     dest_str = (char *) pg_do_encoding_conversion(
!                 (unsigned char *) src_str, len, src_encoding, dest_encoding);
      if (dest_str != src_str)
          len = strlen(dest_str);

--- 474,486 ----
      src_str = VARDATA_ANY(string);
      pg_verify_mbstr_len(src_encoding, src_str, len, false);

!     /* perform conversion */
!     dest_str = (char *) pg_do_encoding_conversion((unsigned char *) src_str,
!                                                   len,
!                                                   src_encoding,
!                                                   dest_encoding);
!
!     /* update len if conversion actually happened */
      if (dest_str != src_str)
          len = strlen(dest_str);

*************** pg_convert(PG_FUNCTION_ARGS)
*** 503,512 ****
  Datum
  length_in_encoding(PG_FUNCTION_ARGS)
  {
!     bytea       *string = PG_GETARG_BYTEA_P(0);
      char       *src_encoding_name = NameStr(*PG_GETARG_NAME(1));
      int            src_encoding = pg_char_to_encoding(src_encoding_name);
!     int            len = VARSIZE(string) - VARHDRSZ;
      int            retval;

      if (src_encoding < 0)
--- 510,520 ----
  Datum
  length_in_encoding(PG_FUNCTION_ARGS)
  {
!     bytea       *string = PG_GETARG_BYTEA_PP(0);
      char       *src_encoding_name = NameStr(*PG_GETARG_NAME(1));
      int            src_encoding = pg_char_to_encoding(src_encoding_name);
!     const char *src_str;
!     int            len;
      int            retval;

      if (src_encoding < 0)
*************** length_in_encoding(PG_FUNCTION_ARGS)
*** 515,525 ****
                   errmsg("invalid encoding name \"%s\"",
                          src_encoding_name)));

!     retval = pg_verify_mbstr_len(src_encoding, VARDATA(string), len, false);
!     PG_RETURN_INT32(retval);

  }

  Datum
  pg_encoding_max_length_sql(PG_FUNCTION_ARGS)
  {
--- 523,541 ----
                   errmsg("invalid encoding name \"%s\"",
                          src_encoding_name)));

!     len = VARSIZE_ANY_EXHDR(string);
!     src_str = VARDATA_ANY(string);

+     retval = pg_verify_mbstr_len(src_encoding, src_str, len, false);
+
+     PG_RETURN_INT32(retval);
  }

+ /*
+  * Get maximum multibyte character length in the specified encoding.
+  *
+  * Note encoding is specified numerically, not by name as above.
+  */
  Datum
  pg_encoding_max_length_sql(PG_FUNCTION_ARGS)
  {
*************** pg_encoding_max_length_sql(PG_FUNCTION_A
*** 532,558 ****
  }

  /*
!  * convert client encoding to server encoding.
   */
  char *
  pg_client_to_server(const char *s, int len)
  {
-     Assert(ClientEncoding);
-
      return pg_any_to_server(s, len, ClientEncoding->encoding);
  }

  /*
!  * convert any encoding to server encoding.
   */
  char *
  pg_any_to_server(const char *s, int len, int encoding)
  {
-     Assert(DatabaseEncoding);
-     Assert(ClientEncoding);
-
      if (len <= 0)
!         return (char *) s;

      if (encoding == DatabaseEncoding->encoding ||
          encoding == PG_SQL_ASCII)
--- 548,578 ----
  }

  /*
!  * Convert client encoding to server encoding.
!  *
!  * See the notes about string conversion functions at the top of this file.
   */
  char *
  pg_client_to_server(const char *s, int len)
  {
      return pg_any_to_server(s, len, ClientEncoding->encoding);
  }

  /*
!  * Convert any encoding to server encoding.
!  *
!  * See the notes about string conversion functions at the top of this file.
!  *
!  * Unlike the other string conversion functions, this will apply validation
!  * even if encoding == DatabaseEncoding->encoding.    This is because this is
!  * used to process data coming in from outside the database, and we never
!  * want to just assume validity.
   */
  char *
  pg_any_to_server(const char *s, int len, int encoding)
  {
      if (len <= 0)
!         return (char *) s;        /* empty string is always valid */

      if (encoding == DatabaseEncoding->encoding ||
          encoding == PG_SQL_ASCII)
*************** pg_any_to_server(const char *s, int len,
*** 594,639 ****
          return (char *) s;
      }

!     if (ClientEncoding->encoding == encoding)
          return perform_default_encoding_conversion(s, len, true);
!     else
!         return (char *) pg_do_encoding_conversion(
!              (unsigned char *) s, len, encoding, DatabaseEncoding->encoding);
  }

  /*
!  * convert server encoding to client encoding.
   */
  char *
  pg_server_to_client(const char *s, int len)
  {
-     Assert(ClientEncoding);
-
      return pg_server_to_any(s, len, ClientEncoding->encoding);
  }

  /*
!  * convert server encoding to any encoding.
   */
  char *
  pg_server_to_any(const char *s, int len, int encoding)
  {
-     Assert(DatabaseEncoding);
-     Assert(ClientEncoding);
-
      if (len <= 0)
!         return (char *) s;

      if (encoding == DatabaseEncoding->encoding ||
!         encoding == PG_SQL_ASCII ||
!         DatabaseEncoding->encoding == PG_SQL_ASCII)
          return (char *) s;        /* assume data is valid */

!     if (ClientEncoding->encoding == encoding)
          return perform_default_encoding_conversion(s, len, false);
!     else
!         return (char *) pg_do_encoding_conversion(
!              (unsigned char *) s, len, DatabaseEncoding->encoding, encoding);
  }

  /*
--- 614,672 ----
          return (char *) s;
      }

!     /* Fast path if we can use cached conversion function */
!     if (encoding == ClientEncoding->encoding)
          return perform_default_encoding_conversion(s, len, true);
!
!     /* General case ... will not work outside transactions */
!     return (char *) pg_do_encoding_conversion((unsigned char *) s,
!                                               len,
!                                               encoding,
!                                               DatabaseEncoding->encoding);
  }

  /*
!  * Convert server encoding to client encoding.
!  *
!  * See the notes about string conversion functions at the top of this file.
   */
  char *
  pg_server_to_client(const char *s, int len)
  {
      return pg_server_to_any(s, len, ClientEncoding->encoding);
  }

  /*
!  * Convert server encoding to any encoding.
!  *
!  * See the notes about string conversion functions at the top of this file.
   */
  char *
  pg_server_to_any(const char *s, int len, int encoding)
  {
      if (len <= 0)
!         return (char *) s;        /* empty string is always valid */

      if (encoding == DatabaseEncoding->encoding ||
!         encoding == PG_SQL_ASCII)
          return (char *) s;        /* assume data is valid */

!     if (DatabaseEncoding->encoding == PG_SQL_ASCII)
!     {
!         /* No conversion is possible, but we must validate the result */
!         (void) pg_verify_mbstr(encoding, s, len, false);
!         return (char *) s;
!     }
!
!     /* Fast path if we can use cached conversion function */
!     if (encoding == ClientEncoding->encoding)
          return perform_default_encoding_conversion(s, len, false);
!
!     /* General case ... will not work outside transactions */
!     return (char *) pg_do_encoding_conversion((unsigned char *) s,
!                                               len,
!                                               DatabaseEncoding->encoding,
!                                               encoding);
  }

  /*
*************** pg_server_to_any(const char *s, int len,
*** 643,649 ****
   *    SetClientEncoding(), no conversion is performed.
   */
  static char *
! perform_default_encoding_conversion(const char *src, int len, bool is_client_to_server)
  {
      char       *result;
      int            src_encoding,
--- 676,683 ----
   *    SetClientEncoding(), no conversion is performed.
   */
  static char *
! perform_default_encoding_conversion(const char *src, int len,
!                                     bool is_client_to_server)
  {
      char       *result;
      int            src_encoding,
*************** raw_pg_bind_textdomain_codeset(const cha
*** 931,941 ****
   * On most platforms, gettext defaults to the codeset implied by LC_CTYPE.
   * When that matches the database encoding, we don't need to do anything.  In
   * CREATE DATABASE, we enforce or trust that the locale's codeset matches the
!  * database encoding, except for the C locale.  (On Windows, we also permit a
   * discrepancy under the UTF8 encoding.)  For the C locale, explicitly bind
   * gettext to the right codeset.
   *
!  * On Windows, gettext defaults to the Windows ANSI code page.  This is a
   * convenient departure for software that passes the strings to Windows ANSI
   * APIs, but we don't do that.  Compel gettext to use database encoding or,
   * failing that, the LC_CTYPE encoding as it would on other platforms.
--- 965,975 ----
   * On most platforms, gettext defaults to the codeset implied by LC_CTYPE.
   * When that matches the database encoding, we don't need to do anything.  In
   * CREATE DATABASE, we enforce or trust that the locale's codeset matches the
!  * database encoding, except for the C locale.    (On Windows, we also permit a
   * discrepancy under the UTF8 encoding.)  For the C locale, explicitly bind
   * gettext to the right codeset.
   *
!  * On Windows, gettext defaults to the Windows ANSI code page.    This is a
   * convenient departure for software that passes the strings to Windows ANSI
   * APIs, but we don't do that.  Compel gettext to use database encoding or,
   * failing that, the LC_CTYPE encoding as it would on other platforms.
*************** pg_bind_textdomain_codeset(const char *d
*** 980,1007 ****
  int
  GetDatabaseEncoding(void)
  {
-     Assert(DatabaseEncoding);
      return DatabaseEncoding->encoding;
  }

  const char *
  GetDatabaseEncodingName(void)
  {
-     Assert(DatabaseEncoding);
      return DatabaseEncoding->name;
  }

  Datum
  getdatabaseencoding(PG_FUNCTION_ARGS)
  {
-     Assert(DatabaseEncoding);
      return DirectFunctionCall1(namein, CStringGetDatum(DatabaseEncoding->name));
  }

  Datum
  pg_client_encoding(PG_FUNCTION_ARGS)
  {
-     Assert(ClientEncoding);
      return DirectFunctionCall1(namein, CStringGetDatum(ClientEncoding->name));
  }

--- 1014,1037 ----
*************** pg_client_encoding(PG_FUNCTION_ARGS)
*** 1014,1020 ****
  int
  GetMessageEncoding(void)
  {
-     Assert(MessageEncoding);
      return MessageEncoding->encoding;
  }

--- 1044,1049 ----

Re: BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding

От
Noah Misch
Дата:
On Wed, Feb 19, 2014 at 08:22:13PM -0500, Tom Lane wrote:
> The more I looked into mbutils.c, the less happy I got.  The attached
> proposed patch takes care of the missing-verification hole in
> pg_do_encoding_conversion() and pg_server_to_any(), and also gets rid
> of what I believe to be obsolete provisions in pg_do_encoding_conversion
> to "work" if called outside a transaction --- if you consider it working
> to completely fail to honor its API contract.  That should no longer be
> necessary now that we perform client<->server encoding conversions via
> perform_default_encoding_conversion rather than here.

I like these changes.  In particular, coping with the absence of a conversion
function by calling ereport(LOG) and returning the source string was wrong for
nearly every caller, but you'd need to try an encoding like MULE_INTERNAL to
notice the problem.  Good riddance.

> How much of this is back-patch material, do you think?

None of it.  While many of the failures to validate against a character
encoding are clear bugs, applications hum along in spite of such bugs and
break when we tighten the checks.  I don't see a concern to override that
here.  Folks who want the tighter checking have some workarounds available.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Noah Misch <noah@leadboat.com> writes:
> On Wed, Feb 19, 2014 at 08:22:13PM -0500, Tom Lane wrote:
>> How much of this is back-patch material, do you think?

> None of it.  While many of the failures to validate against a character
> encoding are clear bugs, applications hum along in spite of such bugs and
> break when we tighten the checks.  I don't see a concern to override that
> here.  Folks who want the tighter checking have some workarounds available.

That's certainly a reasonable position to take concerning the changes for
outside-a-transaction behavior.  However, I think there's a case to be
made for adding the additional pg_verify_mbstr() calls in the back
branches.  We've been promising since around 8.3 that invalidly encoded
data can't get into a database, and it's disturbing to find that there
are leaks in that.
        regards, tom lane



On Fri, Feb 21, 2014 at 05:20:06PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Wed, Feb 19, 2014 at 08:22:13PM -0500, Tom Lane wrote:
> >> How much of this is back-patch material, do you think?
> 
> > None of it.  While many of the failures to validate against a character
> > encoding are clear bugs, applications hum along in spite of such bugs and
> > break when we tighten the checks.  I don't see a concern to override that
> > here.  Folks who want the tighter checking have some workarounds available.
> 
> That's certainly a reasonable position to take concerning the changes for
> outside-a-transaction behavior.  However, I think there's a case to be
> made for adding the additional pg_verify_mbstr() calls in the back
> branches.  We've been promising since around 8.3 that invalidly encoded
> data can't get into a database, and it's disturbing to find that there
> are leaks in that.

I had a dark corner of an app break from the 8.4-vintage change to make
E'abc\000def'::text raise an error rather than truncate the string.  The old
behavior was clearly wrong, but I was still glad the change arrived in a major
release; the truncation happened to be harmless for that app.  Adding
pg_verify_mbstr() calls creates a similar situation.

Anything broken by the outside-a-transaction behavior change will be in C
code.  I'll expect between zero and one reports of breakage from that, so
whether to back-patch that is more academic.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Noah Misch <noah@leadboat.com> writes:
> On Fri, Feb 21, 2014 at 05:20:06PM -0500, Tom Lane wrote:
>> ...  However, I think there's a case to be
>> made for adding the additional pg_verify_mbstr() calls in the back
>> branches.  We've been promising since around 8.3 that invalidly encoded
>> data can't get into a database, and it's disturbing to find that there
>> are leaks in that.

> I had a dark corner of an app break from the 8.4-vintage change to make
> E'abc\000def'::text raise an error rather than truncate the string.  The old
> behavior was clearly wrong, but I was still glad the change arrived in a major
> release; the truncation happened to be harmless for that app.  Adding
> pg_verify_mbstr() calls creates a similar situation.

Since I'm not hearing anybody else argue for a back-patch, I've committed
this in HEAD only.
        regards, tom lane