Обсуждение: BUG #3697: utf8 issue: can not reimport a table that was successfully exported.

Поиск
Список
Период
Сортировка

BUG #3697: utf8 issue: can not reimport a table that was successfully exported.

От
"Marc Mamin"
Дата:
The following bug has been logged online:

Bug reference:      3697
Logged by:          Marc Mamin
Email address:      m.mamin@intershop.de
PostgreSQL version: 8.2.4
Operating system:   SuSE Linux 9.1 (i586)
Description:        utf8 issue: can not reimport  a table that was
successfully exported.
Details:

Hello,

I'm not sure this is a bug; the problem might be related to a client
encoding issue.


My Database is on a Linux server which I connect to using putty from
Windows.

both server and client are set to UTF8:

 client_encoding                 | UTF8
 backslash_quote                 | safe_encoding
 server_encoding                 | UTF8


I stumbled on this issue while trying to import a "malicious" user agent
string...

I didn't check if all characters are valid UTF8...


My concern is about database recovery.
I'm using pg_dump to regulary export my users, bu according to the example
below,
it seems that  my dumps may be worthless !





May be you should consider not to publish this before a fix exist
as this is a serious issue which could eventually be exploited
to damage existing instances (for the case this is really a bug)...

regards,

Marc Mamin



steps to repeat:


CREATE TABLE utf8_test(s varchar);


CREATE OR REPLACE FUNCTION f_utf8_test( st VARCHAR) RETURNS INT AS $$

   DECLARE
     quotedline varchar = quote_literal($1);

   BEGIN
     INSERT INTO utf8_test ( s ) VALUES (  quotedline);
     RETURN 0;
   END;

$$ LANGUAGE plpgsql;

select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
\xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
\xE3\xEE\xF1\xF3\xE4
xE4\xE6 \xCD\xC1 \xD0\xC1")');


-- here the same statement, but with all backslashed duplicated for the case
when the string was modified when posting this issue:


select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
\\xE0\\xF0\\xF1\\xF2\\xE2\\xE5\\xED\\xED\\xFB\\xE9 \\xE2\\xFB\\xF1\\xF8\\9
\\xE3\\xEE\\xF1\\xF3\\xE4
xE4\\xE6 \\xCD\\xC1 \\xD0\\xC1")');


   WARNING:  nonstandard use of escape in a string literal
   LINE 1: select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind

COPY utf8_test TO '/tmp/utf8_test.txt';

  COPY 1

COPY utf8_test FROM '/tmp/utf8_test.txt';

ERROR:  invalid byte sequence for encoding "UTF8": 0xd3ce
HINT:  This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
CONTEXT:  COPY utf8_test, line 1


The same isuue occure when using pg_dump:


 pg_dump  -i -v  -p 5433  -Uisdb2  -tutf8_test > /tmp/utf8_dump

 pg_dump: server version: 8.2.4; pg_dump version: 8.2.1
 pg_dump: proceeding despite version mismatch


 psql -f"/tmp/utf8_dump"

  =>
 psql:/tmp/utf8_dump:40: ERROR:  invalid byte sequence for encoding "UTF8":
0xd3ce
 HINT:  This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
 CONTEXT:  COPY utf8_test, line 1

Re: BUG #3697: utf8 issue: can not reimport a table that was successfully exported.

От
Tom Lane
Дата:
"Marc Mamin" <m.mamin@intershop.de> writes:
> I didn't check if all characters are valid UTF8...

They aren't ...

> select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
> \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
> \xE3\xEE\xF1\xF3\xE4
> xE4\xE6 \xCD\xC1 \xD0\xC1")');

In 8.3 that will throw an error:

utf8=# select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
utf8'# \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
utf8'# \xE3\xEE\xF1\xF3\xE4
utf8'# xE4\xE6 \xCD\xC1 \xD0\xC1")');
WARNING:  nonstandard use of escape in a string literal
LINE 1: select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
                           ^
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.
ERROR:  invalid byte sequence for encoding "UTF8": 0xe0f0f1
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 
utf8=#

However, since this behavior isn't backwards-compatible, there's not
much appetite for back-patching it.

I don't think this is a security issue --- if you don't quote
backslashes in untrusted input you'll have problems far worse
than this one.

            regards, tom lane

Re: BUG #3697: utf8 issue: can not reimport a table that was successfully exported.

От
Tom Lane
Дата:
"Marc Mamin" <M.Mamin@intershop.de> writes:
> Is there a recommendation how to clean these data (I know where to
> search for them)

Usually people do it by running iconv with the delete-bad-data option
on a pg_dump file.

            regards, tom lane

Re: BUG #3697: utf8 issue: can not reimport a table that was successfully exported.

От
"Marc Mamin"
Дата:
Thank you for your quick response,

> if you don't quote backslashes in untrusted input you'll have problems
far worse than this one

I do it now but not since by db is live...=20
So I probably have some invalid caraters in.=20
Is this an issue that must be fixed before I can upgrade to 8.3 ?
Is there a recommendation how to clean these data (I know where to
search for them)

Thanks,

Marc Mamin


=20

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20
Sent: Thursday, October 25, 2007 6:08 PM
To: Marc Mamin
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #3697: utf8 issue: can not reimport a table that
was successfully exported.=20

"Marc Mamin" <m.mamin@intershop.de> writes:
> I didn't check if all characters are valid UTF8...

They aren't ...

> select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
> \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
> \xE3\xEE\xF1\xF3\xE4
> xE4\xE6 \xCD\xC1 \xD0\xC1")');

In 8.3 that will throw an error:

utf8=3D# select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
utf8'# \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9
utf8'# \xE3\xEE\xF1\xF3\xE4 utf8'# xE4\xE6 \xCD\xC1 \xD0\xC1")');
WARNING:  nonstandard use of escape in a string literal LINE 1: select
f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind
                           ^
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.
ERROR:  invalid byte sequence for encoding "UTF8": 0xe0f0f1
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
utf8=3D#=20

However, since this behavior isn't backwards-compatible, there's not
much appetite for back-patching it.

I don't think this is a security issue --- if you don't quote
backslashes in untrusted input you'll have problems far worse than this
one.

            regards, tom lane