Обсуждение: BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding
BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding
От
digoal@126.com
Дата:
The following bug has been logged on the website: Bug reference: 9210 Logged by: digoal.zhou Email address: digoal@126.com PostgreSQL version: 9.3.2 Operating system: CentOS 6.4 Description: In PostgreSQL , in addition sql_ascii coded character set encoding does not check the legitimacy of the other characters are encoded legality checks . This has clear instructions in the PostgreSQL manual: Reference http://www.postgresql.org/docs/9.3/static/multibyte.html http://blog.163.com/digoal @ 126/blog/static/163877040201211281407682 / http://blog.163.com/digoal @ 126/blog/static/16387704020132150381348 / http://blog.163.com/digoal @ 126/blog/static/1638770402011718112210516 / For example, we are using a database server UTF8 character set , then the time stored UTF8 database checks whether the agreement , if it does not it will error . For example: ERROR: invalid byte sequence for encoding "UTF8": 0xee 0xc1 0x22 It would appear that illegal characters should not be stored in the database, but recently doing some data migration discovered this problem , the two databases on two servers, operating systems consistent LANG, are UTF8, database the characters are also UTF8, but after export data to another database into some of the data that is reported such an error , and therefore can not normally migrate some data . >From the face of it , it should be stored in the database of the illegal character . Then since there are checks , but also how to store into it? I used here convert_from this function to restore this phenomenon. We see , in addition to sql_ascii will convert character sets. postgres = # select t, t :: bytea from convert_from ('\ xeec1', 'gbk') as g (t);  t | t ---- + ----------  He | \ xe79b8d (1 row) postgres = # select t, t :: bytea from convert_from ('\ xeec1', 'utf8') as g (t); ERROR: invalid byte sequence for encoding "UTF8": 0xee 0xc1 But truthfully use sql_ascii is stored byte stream , without any conversion. postgres = # select t, t :: bytea from convert_from ('\ xeec1', 'sql_ascii') as g (t);  t | t --- + --------    | \ Xeec1 (1 row) Using this method, the illegal byte stream into the database . postgres = # create table test (info text); CREATE TABLE postgres = # insert into test values ââ(convert_from ('\ xeec1', 'sql_ascii')); INSERT 0 1 postgres = # select info, info :: bytea from test;  info | info ------ + --------       | \ Xeec1 (1 row) This record came out , then the backup is not restored , as follows: postgres @ db-192-168-173-55-> pg_dump-t test-a | psql-f - SET SET SET SET SET SET psql: <stdin>: 19: ERROR: invalid byte sequence for encoding "UTF8": 0xee 0xc1 CONTEXT: COPY test, line 1 Can be used directly in the database, for example : postgres = # insert into test select * from test; INSERT 0 1 postgres = # select info, info :: bytea from test;  info | info ------ + --------       | \ Xeec1       | \ Xeec1 (2 rows) This problem makes the backup data reduction becomes a hassle . There may be other ways to put the data into the database illegally , which resulted in the present situation . If you want to import the data across databases currently available through DBLINK or external table to import the illegal character of the problem the way to use the backup to restore occurs . [ References ] 1. Src / backend / utils / mb / mbutils.c 2. Src / backend / utils / mb / wchar.c 3. Postgres = # \ df convert *                               List of functions    Schema | Name | Result data type | Argument data types | Type ------------ + -------------- + ------------------ + --- ------------------ + --------  pg_catalog | convert | bytea | bytea, name, name | normal  pg_catalog | convert_from | text | bytea, name | normal  pg_catalog | convert_to | bytea | text, name | normal (3 rows)
digoal@126.com writes: > select t, t::bytea from convert_from('\xeec1', 'sql_ascii') as g(t); > [ fails to check that string is valid in database encoding ] Hm, yeah. Normal input to the database goes through pg_any_to_server(), which will apply a validation step if the source encoding is SQL_ASCII and the destination encoding is something else. However, pg_convert and some other places call pg_do_encoding_conversion() directly, and that function will just quietly do nothing if either encoding is SQL_ASCII. The minimum-refactoring solution to this would be to tweak pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing. I'm not sure if this would break anything we need to have work, though. Thoughts? Do we want to back-patch such a change? regards, tom lane
HI,
Thanks very much.
We use dblink or foreign table migrate datas instead pg_dump now resolve the error data load problem.
--
公益是一辈子的事,I'm Digoal,Just Do It.
公益是一辈子的事,I'm Digoal,Just Do It.
At 2014-02-14 04:49:08,"Tom Lane" <tgl@sss.pgh.pa.us> wrote: >digoal@126.com writes: >> select t, t::bytea from convert_from('\xeec1', 'sql_ascii') as g(t); >> [ fails to check that string is valid in database encoding ] > >Hm, yeah. Normal input to the database goes through pg_any_to_server(), >which will apply a validation step if the source encoding is SQL_ASCII >and the destination encoding is something else. However, pg_convert and >some other places call pg_do_encoding_conversion() directly, and that >function will just quietly do nothing if either encoding is SQL_ASCII. > >The minimum-refactoring solution to this would be to tweak >pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but >the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing. > >I'm not sure if this would break anything we need to have work, >though. Thoughts? Do we want to back-patch such a change? > > regards, tom lane
I wrote: > digoal@126.com writes: >> select t, t::bytea from convert_from('\xeec1', 'sql_ascii') as g(t); >> [ fails to check that string is valid in database encoding ] > Hm, yeah. Normal input to the database goes through pg_any_to_server(), > which will apply a validation step if the source encoding is SQL_ASCII > and the destination encoding is something else. However, pg_convert and > some other places call pg_do_encoding_conversion() directly, and that > function will just quietly do nothing if either encoding is SQL_ASCII. > The minimum-refactoring solution to this would be to tweak > pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but > the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing. > I'm not sure if this would break anything we need to have work, > though. Thoughts? Do we want to back-patch such a change? I looked through all the callers of pg_do_encoding_conversion(), and AFAICS this change is a good idea. There are a whole bunch of places that use pg_do_encoding_conversion() to convert from the database encoding to encoding X (most usually UTF8), and right now if you do that in a SQL_ASCII database you have no assurance whatever that what is produced is actually valid in encoding X. I think we need to close that loophole. I found one place --- utf_u2e() in plperl_helpers.h --- that is aware of the lack of checking and forces a pg_verify_mbstr call for itself; but it apparently is concerned about whether the source data is actually utf8 in the first place, which I think is not really pg_do_encoding_conversion's bailiwick. I'm okay with pg_do_encoding_conversion being a no-op if src_encoding == dest_encoding. Barring objections, I will fix and back-patch this. regards, tom lane
I wrote: > I looked through all the callers of pg_do_encoding_conversion(), and > AFAICS this change is a good idea. There are a whole bunch of places > that use pg_do_encoding_conversion() to convert from the database encoding > to encoding X (most usually UTF8), and right now if you do that in a > SQL_ASCII database you have no assurance whatever that what is produced > is actually valid in encoding X. I think we need to close that loophole. BTW, a second look confirms that all but one or two of the callers of pg_do_encoding_conversion() are in fact converting either to or from the database encoding. That probably means they ought to be using pg_any_to_server or pg_server_to_any instead, so that they can take advantage of the pre-cached default conversion in the not-unlikely situation that the other encoding is the current client_encoding. This would imply that we ought to put a validate-if-source-is-SQL_ASCII step into pg_server_to_any() as well, since that function currently short-circuits calling pg_do_encoding_conversion() in that case. regards, tom lane
I wrote: >> The minimum-refactoring solution to this would be to tweak >> pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but >> the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing. >> I'm not sure if this would break anything we need to have work, >> though. Thoughts? Do we want to back-patch such a change? > I looked through all the callers of pg_do_encoding_conversion(), and > AFAICS this change is a good idea. There are a whole bunch of places > that use pg_do_encoding_conversion() to convert from the database encoding > to encoding X (most usually UTF8), and right now if you do that in a > SQL_ASCII database you have no assurance whatever that what is produced > is actually valid in encoding X. I think we need to close that loophole. The more I looked into mbutils.c, the less happy I got. The attached proposed patch takes care of the missing-verification hole in pg_do_encoding_conversion() and pg_server_to_any(), and also gets rid of what I believe to be obsolete provisions in pg_do_encoding_conversion to "work" if called outside a transaction --- if you consider it working to completely fail to honor its API contract. That should no longer be necessary now that we perform client<->server encoding conversions via perform_default_encoding_conversion rather than here. For testing I inserted an "Assert(IsTransactionState());" at the top of pg_do_encoding_conversion(), and couldn't trigger it, so I'm fairly sure this change is OK. Still, back-patching it might not be a good thing. On the other hand, if there is still any code path that can get here outside a transaction, we probably ought to find out rather than allow completely bogus data to be returned. I also made some more-cosmetic improvements, notably removing a bunch of Asserts that are certainly dead code because the relevant variables are never NULL. I've not done anything yet about simplifying unnecessary calls of pg_do_encoding_conversion into pg_server_to_any/pg_any_to_server. How much of this is back-patch material, do you think? regards, tom lane diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c index 08440e9..7f43cae 100644 *** a/src/backend/utils/mb/mbutils.c --- b/src/backend/utils/mb/mbutils.c *************** *** 1,10 **** ! /* ! * This file contains public functions for conversion between ! * client encoding and server (database) encoding. * ! * Tatsuo Ishii * ! * src/backend/utils/mb/mbutils.c */ #include "postgres.h" --- 1,36 ---- ! /*------------------------------------------------------------------------- * ! * mbutils.c ! * This file contains functions for encoding conversion. * ! * The string-conversion functions in this file share some API quirks. ! * Note the following: ! * ! * The functions return a palloc'd, null-terminated string if conversion ! * is required. However, if no conversion is performed, the given source ! * string pointer is returned as-is. ! * ! * Although the presence of a length argument means that callers can pass ! * non-null-terminated strings, care is required because the same string ! * will be passed back if no conversion occurs. Such callers *must* check ! * whether result == src and handle that case differently. ! * ! * If the source and destination encodings are the same, the source string ! * is returned without any verification; it's assumed to be valid data. ! * If that might not be the case, the caller is responsible for validating ! * the string using a separate call to pg_verify_mbstr(). Whenever the ! * source and destination encodings are different, the functions ensure that ! * the result is validly encoded according to the destination encoding. ! * ! * ! * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group ! * Portions Copyright (c) 1994, Regents of the University of California ! * ! * ! * IDENTIFICATION ! * src/backend/utils/mb/mbutils.c ! * ! *------------------------------------------------------------------------- */ #include "postgres.h" *************** InitializeClientEncoding(void) *** 290,296 **** int pg_get_client_encoding(void) { - Assert(ClientEncoding); return ClientEncoding->encoding; } --- 316,321 ---- *************** pg_get_client_encoding(void) *** 300,328 **** const char * pg_get_client_encoding_name(void) { - Assert(ClientEncoding); return ClientEncoding->name; } /* ! * Apply encoding conversion on src and return it. The encoding ! * conversion function is chosen from the pg_conversion system catalog ! * marked as "default". If it is not found in the schema search path, ! * it's taken from pg_catalog schema. If it even is not in the schema, ! * warn and return src. ! * ! * If conversion occurs, a palloc'd null-terminated string is returned. ! * In the case of no conversion, src is returned. ! * ! * CAUTION: although the presence of a length argument means that callers ! * can pass non-null-terminated strings, care is required because the same ! * string will be passed back if no conversion occurs. Such callers *must* ! * check whether result == src and handle that case differently. * ! * Note: we try to avoid raising error, since that could get us into ! * infinite recursion when this function is invoked during error message ! * sending. It should be OK to raise error for overlength strings though, ! * since the recursion will come with a shorter message. */ unsigned char * pg_do_encoding_conversion(unsigned char *src, int len, --- 325,337 ---- const char * pg_get_client_encoding_name(void) { return ClientEncoding->name; } /* ! * Convert src string to another encoding (general case). * ! * See the notes about string conversion functions at the top of this file. */ unsigned char * pg_do_encoding_conversion(unsigned char *src, int len, *************** pg_do_encoding_conversion(unsigned char *** 331,369 **** unsigned char *result; Oid proc; ! if (!IsTransactionState()) ! return src; if (src_encoding == dest_encoding) ! return src; ! if (src_encoding == PG_SQL_ASCII || dest_encoding == PG_SQL_ASCII) ! return src; ! if (len <= 0) return src; proc = FindDefaultConversionProc(src_encoding, dest_encoding); if (!OidIsValid(proc)) ! { ! ereport(LOG, (errcode(ERRCODE_UNDEFINED_FUNCTION), errmsg("default conversion function for encoding \"%s\" to \"%s\" does not exist", pg_encoding_to_char(src_encoding), pg_encoding_to_char(dest_encoding)))); - return src; - } - - /* - * XXX we should avoid throwing errors in OidFunctionCall. Otherwise we - * are going into infinite loop! So we have to make sure that the - * function exists before calling OidFunctionCall. - */ - if (!SearchSysCacheExists1(PROCOID, ObjectIdGetDatum(proc))) - { - elog(LOG, "cache lookup failed for function %u", proc); - return src; - } /* * Allocate space for conversion result, being wary of integer overflow --- 340,371 ---- unsigned char *result; Oid proc; ! if (len <= 0) ! return src; /* empty string is always valid */ if (src_encoding == dest_encoding) ! return src; /* no conversion required, assume valid */ ! if (dest_encoding == PG_SQL_ASCII) ! return src; /* any string is valid in SQL_ASCII */ ! if (src_encoding == PG_SQL_ASCII) ! { ! /* No conversion is possible, but we must validate the result */ ! (void) pg_verify_mbstr(dest_encoding, (const char *) src, len, false); return src; + } + + if (!IsTransactionState()) /* shouldn't happen */ + elog(ERROR, "cannot perform encoding conversion outside a transaction"); proc = FindDefaultConversionProc(src_encoding, dest_encoding); if (!OidIsValid(proc)) ! ereport(ERROR, (errcode(ERRCODE_UNDEFINED_FUNCTION), errmsg("default conversion function for encoding \"%s\" to \"%s\" does not exist", pg_encoding_to_char(src_encoding), pg_encoding_to_char(dest_encoding)))); /* * Allocate space for conversion result, being wary of integer overflow *************** pg_do_encoding_conversion(unsigned char *** 387,393 **** } /* ! * Convert string using encoding_name. The source * encoding is the DB encoding. * * BYTEA convert_to(TEXT string, NAME encoding_name) */ --- 389,395 ---- } /* ! * Convert string to encoding encoding_name. The source * encoding is the DB encoding. * * BYTEA convert_to(TEXT string, NAME encoding_name) */ *************** pg_convert_to(PG_FUNCTION_ARGS) *** 412,418 **** } /* ! * Convert string using encoding_name. The destination * encoding is the DB encoding. * * TEXT convert_from(BYTEA string, NAME encoding_name) */ --- 414,420 ---- } /* ! * Convert string from encoding encoding_name. The destination * encoding is the DB encoding. * * TEXT convert_from(BYTEA string, NAME encoding_name) */ *************** pg_convert_from(PG_FUNCTION_ARGS) *** 439,445 **** } /* ! * Convert string using encoding_names. * * BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name) */ --- 441,447 ---- } /* ! * Convert string between two arbitrary encodings. * * BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name) */ *************** pg_convert(PG_FUNCTION_ARGS) *** 472,479 **** src_str = VARDATA_ANY(string); pg_verify_mbstr_len(src_encoding, src_str, len, false); ! dest_str = (char *) pg_do_encoding_conversion( ! (unsigned char *) src_str, len, src_encoding, dest_encoding); if (dest_str != src_str) len = strlen(dest_str); --- 474,486 ---- src_str = VARDATA_ANY(string); pg_verify_mbstr_len(src_encoding, src_str, len, false); ! /* perform conversion */ ! dest_str = (char *) pg_do_encoding_conversion((unsigned char *) src_str, ! len, ! src_encoding, ! dest_encoding); ! ! /* update len if conversion actually happened */ if (dest_str != src_str) len = strlen(dest_str); *************** pg_convert(PG_FUNCTION_ARGS) *** 503,512 **** Datum length_in_encoding(PG_FUNCTION_ARGS) { ! bytea *string = PG_GETARG_BYTEA_P(0); char *src_encoding_name = NameStr(*PG_GETARG_NAME(1)); int src_encoding = pg_char_to_encoding(src_encoding_name); ! int len = VARSIZE(string) - VARHDRSZ; int retval; if (src_encoding < 0) --- 510,520 ---- Datum length_in_encoding(PG_FUNCTION_ARGS) { ! bytea *string = PG_GETARG_BYTEA_PP(0); char *src_encoding_name = NameStr(*PG_GETARG_NAME(1)); int src_encoding = pg_char_to_encoding(src_encoding_name); ! const char *src_str; ! int len; int retval; if (src_encoding < 0) *************** length_in_encoding(PG_FUNCTION_ARGS) *** 515,525 **** errmsg("invalid encoding name \"%s\"", src_encoding_name))); ! retval = pg_verify_mbstr_len(src_encoding, VARDATA(string), len, false); ! PG_RETURN_INT32(retval); } Datum pg_encoding_max_length_sql(PG_FUNCTION_ARGS) { --- 523,541 ---- errmsg("invalid encoding name \"%s\"", src_encoding_name))); ! len = VARSIZE_ANY_EXHDR(string); ! src_str = VARDATA_ANY(string); + retval = pg_verify_mbstr_len(src_encoding, src_str, len, false); + + PG_RETURN_INT32(retval); } + /* + * Get maximum multibyte character length in the specified encoding. + * + * Note encoding is specified numerically, not by name as above. + */ Datum pg_encoding_max_length_sql(PG_FUNCTION_ARGS) { *************** pg_encoding_max_length_sql(PG_FUNCTION_A *** 532,558 **** } /* ! * convert client encoding to server encoding. */ char * pg_client_to_server(const char *s, int len) { - Assert(ClientEncoding); - return pg_any_to_server(s, len, ClientEncoding->encoding); } /* ! * convert any encoding to server encoding. */ char * pg_any_to_server(const char *s, int len, int encoding) { - Assert(DatabaseEncoding); - Assert(ClientEncoding); - if (len <= 0) ! return (char *) s; if (encoding == DatabaseEncoding->encoding || encoding == PG_SQL_ASCII) --- 548,578 ---- } /* ! * Convert client encoding to server encoding. ! * ! * See the notes about string conversion functions at the top of this file. */ char * pg_client_to_server(const char *s, int len) { return pg_any_to_server(s, len, ClientEncoding->encoding); } /* ! * Convert any encoding to server encoding. ! * ! * See the notes about string conversion functions at the top of this file. ! * ! * Unlike the other string conversion functions, this will apply validation ! * even if encoding == DatabaseEncoding->encoding. This is because this is ! * used to process data coming in from outside the database, and we never ! * want to just assume validity. */ char * pg_any_to_server(const char *s, int len, int encoding) { if (len <= 0) ! return (char *) s; /* empty string is always valid */ if (encoding == DatabaseEncoding->encoding || encoding == PG_SQL_ASCII) *************** pg_any_to_server(const char *s, int len, *** 594,639 **** return (char *) s; } ! if (ClientEncoding->encoding == encoding) return perform_default_encoding_conversion(s, len, true); ! else ! return (char *) pg_do_encoding_conversion( ! (unsigned char *) s, len, encoding, DatabaseEncoding->encoding); } /* ! * convert server encoding to client encoding. */ char * pg_server_to_client(const char *s, int len) { - Assert(ClientEncoding); - return pg_server_to_any(s, len, ClientEncoding->encoding); } /* ! * convert server encoding to any encoding. */ char * pg_server_to_any(const char *s, int len, int encoding) { - Assert(DatabaseEncoding); - Assert(ClientEncoding); - if (len <= 0) ! return (char *) s; if (encoding == DatabaseEncoding->encoding || ! encoding == PG_SQL_ASCII || ! DatabaseEncoding->encoding == PG_SQL_ASCII) return (char *) s; /* assume data is valid */ ! if (ClientEncoding->encoding == encoding) return perform_default_encoding_conversion(s, len, false); ! else ! return (char *) pg_do_encoding_conversion( ! (unsigned char *) s, len, DatabaseEncoding->encoding, encoding); } /* --- 614,672 ---- return (char *) s; } ! /* Fast path if we can use cached conversion function */ ! if (encoding == ClientEncoding->encoding) return perform_default_encoding_conversion(s, len, true); ! ! /* General case ... will not work outside transactions */ ! return (char *) pg_do_encoding_conversion((unsigned char *) s, ! len, ! encoding, ! DatabaseEncoding->encoding); } /* ! * Convert server encoding to client encoding. ! * ! * See the notes about string conversion functions at the top of this file. */ char * pg_server_to_client(const char *s, int len) { return pg_server_to_any(s, len, ClientEncoding->encoding); } /* ! * Convert server encoding to any encoding. ! * ! * See the notes about string conversion functions at the top of this file. */ char * pg_server_to_any(const char *s, int len, int encoding) { if (len <= 0) ! return (char *) s; /* empty string is always valid */ if (encoding == DatabaseEncoding->encoding || ! encoding == PG_SQL_ASCII) return (char *) s; /* assume data is valid */ ! if (DatabaseEncoding->encoding == PG_SQL_ASCII) ! { ! /* No conversion is possible, but we must validate the result */ ! (void) pg_verify_mbstr(encoding, s, len, false); ! return (char *) s; ! } ! ! /* Fast path if we can use cached conversion function */ ! if (encoding == ClientEncoding->encoding) return perform_default_encoding_conversion(s, len, false); ! ! /* General case ... will not work outside transactions */ ! return (char *) pg_do_encoding_conversion((unsigned char *) s, ! len, ! DatabaseEncoding->encoding, ! encoding); } /* *************** pg_server_to_any(const char *s, int len, *** 643,649 **** * SetClientEncoding(), no conversion is performed. */ static char * ! perform_default_encoding_conversion(const char *src, int len, bool is_client_to_server) { char *result; int src_encoding, --- 676,683 ---- * SetClientEncoding(), no conversion is performed. */ static char * ! perform_default_encoding_conversion(const char *src, int len, ! bool is_client_to_server) { char *result; int src_encoding, *************** raw_pg_bind_textdomain_codeset(const cha *** 931,941 **** * On most platforms, gettext defaults to the codeset implied by LC_CTYPE. * When that matches the database encoding, we don't need to do anything. In * CREATE DATABASE, we enforce or trust that the locale's codeset matches the ! * database encoding, except for the C locale. (On Windows, we also permit a * discrepancy under the UTF8 encoding.) For the C locale, explicitly bind * gettext to the right codeset. * ! * On Windows, gettext defaults to the Windows ANSI code page. This is a * convenient departure for software that passes the strings to Windows ANSI * APIs, but we don't do that. Compel gettext to use database encoding or, * failing that, the LC_CTYPE encoding as it would on other platforms. --- 965,975 ---- * On most platforms, gettext defaults to the codeset implied by LC_CTYPE. * When that matches the database encoding, we don't need to do anything. In * CREATE DATABASE, we enforce or trust that the locale's codeset matches the ! * database encoding, except for the C locale. (On Windows, we also permit a * discrepancy under the UTF8 encoding.) For the C locale, explicitly bind * gettext to the right codeset. * ! * On Windows, gettext defaults to the Windows ANSI code page. This is a * convenient departure for software that passes the strings to Windows ANSI * APIs, but we don't do that. Compel gettext to use database encoding or, * failing that, the LC_CTYPE encoding as it would on other platforms. *************** pg_bind_textdomain_codeset(const char *d *** 980,1007 **** int GetDatabaseEncoding(void) { - Assert(DatabaseEncoding); return DatabaseEncoding->encoding; } const char * GetDatabaseEncodingName(void) { - Assert(DatabaseEncoding); return DatabaseEncoding->name; } Datum getdatabaseencoding(PG_FUNCTION_ARGS) { - Assert(DatabaseEncoding); return DirectFunctionCall1(namein, CStringGetDatum(DatabaseEncoding->name)); } Datum pg_client_encoding(PG_FUNCTION_ARGS) { - Assert(ClientEncoding); return DirectFunctionCall1(namein, CStringGetDatum(ClientEncoding->name)); } --- 1014,1037 ---- *************** pg_client_encoding(PG_FUNCTION_ARGS) *** 1014,1020 **** int GetMessageEncoding(void) { - Assert(MessageEncoding); return MessageEncoding->encoding; } --- 1044,1049 ----
On Wed, Feb 19, 2014 at 08:22:13PM -0500, Tom Lane wrote: > The more I looked into mbutils.c, the less happy I got. The attached > proposed patch takes care of the missing-verification hole in > pg_do_encoding_conversion() and pg_server_to_any(), and also gets rid > of what I believe to be obsolete provisions in pg_do_encoding_conversion > to "work" if called outside a transaction --- if you consider it working > to completely fail to honor its API contract. That should no longer be > necessary now that we perform client<->server encoding conversions via > perform_default_encoding_conversion rather than here. I like these changes. In particular, coping with the absence of a conversion function by calling ereport(LOG) and returning the source string was wrong for nearly every caller, but you'd need to try an encoding like MULE_INTERNAL to notice the problem. Good riddance. > How much of this is back-patch material, do you think? None of it. While many of the failures to validate against a character encoding are clear bugs, applications hum along in spite of such bugs and break when we tighten the checks. I don't see a concern to override that here. Folks who want the tighter checking have some workarounds available. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Noah Misch <noah@leadboat.com> writes: > On Wed, Feb 19, 2014 at 08:22:13PM -0500, Tom Lane wrote: >> How much of this is back-patch material, do you think? > None of it. While many of the failures to validate against a character > encoding are clear bugs, applications hum along in spite of such bugs and > break when we tighten the checks. I don't see a concern to override that > here. Folks who want the tighter checking have some workarounds available. That's certainly a reasonable position to take concerning the changes for outside-a-transaction behavior. However, I think there's a case to be made for adding the additional pg_verify_mbstr() calls in the back branches. We've been promising since around 8.3 that invalidly encoded data can't get into a database, and it's disturbing to find that there are leaks in that. regards, tom lane
On Fri, Feb 21, 2014 at 05:20:06PM -0500, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Wed, Feb 19, 2014 at 08:22:13PM -0500, Tom Lane wrote: > >> How much of this is back-patch material, do you think? > > > None of it. While many of the failures to validate against a character > > encoding are clear bugs, applications hum along in spite of such bugs and > > break when we tighten the checks. I don't see a concern to override that > > here. Folks who want the tighter checking have some workarounds available. > > That's certainly a reasonable position to take concerning the changes for > outside-a-transaction behavior. However, I think there's a case to be > made for adding the additional pg_verify_mbstr() calls in the back > branches. We've been promising since around 8.3 that invalidly encoded > data can't get into a database, and it's disturbing to find that there > are leaks in that. I had a dark corner of an app break from the 8.4-vintage change to make E'abc\000def'::text raise an error rather than truncate the string. The old behavior was clearly wrong, but I was still glad the change arrived in a major release; the truncation happened to be harmless for that app. Adding pg_verify_mbstr() calls creates a similar situation. Anything broken by the outside-a-transaction behavior change will be in C code. I'll expect between zero and one reports of breakage from that, so whether to back-patch that is more academic. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Noah Misch <noah@leadboat.com> writes: > On Fri, Feb 21, 2014 at 05:20:06PM -0500, Tom Lane wrote: >> ... However, I think there's a case to be >> made for adding the additional pg_verify_mbstr() calls in the back >> branches. We've been promising since around 8.3 that invalidly encoded >> data can't get into a database, and it's disturbing to find that there >> are leaks in that. > I had a dark corner of an app break from the 8.4-vintage change to make > E'abc\000def'::text raise an error rather than truncate the string. The old > behavior was clearly wrong, but I was still glad the change arrived in a major > release; the truncation happened to be harmless for that app. Adding > pg_verify_mbstr() calls creates a similar situation. Since I'm not hearing anybody else argue for a back-patch, I've committed this in HEAD only. regards, tom lane