Re: BUG #5221: errors converting string to utf8

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: BUG #5221: errors converting string to utf8
Дата
Msg-id 4B15141B.5050809@postnewspapers.com.au
обсуждение исходный текст
Ответ на BUG #5221: errors converting string to utf8  ("DM" <dmedri@ciseonweb.it>)
Ответы Re: BUG #5221: errors converting string to utf8  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
On 1/12/2009 7:36 PM, DM wrote:

> select convert('Vall\xe9e','LATIN1','UTF8');

You expect the result 'Vallée', right? As in the utf-8 bytes

   Vall\xc3\xa9e

?

> ERROR:  invalid byte sequence for encoding "UTF8": 0xe965
> HINT: This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".

The HINT tells you what's going on here.

The error isn't being raised by convert(...). You'll get the same error
if you just:

     select 'Vall\xe9e';

... because the problem is that the server is trying to treat that
literal as a string in your client_encoding, which is UTF-8, and convert
it to the server encoding.

You will note that the convert(...) function takes a 'bytea' argument
rather than 'text'. That's because PostgreSQL string literals are
expected to be valid text in the current encoding. Entering a string
literal with a byte sequence invalid in the current encoding will be
problematic. You need to supply a valid bytea value, which involves
juggling some interesting double-escaping and using octal escapes for
characters.

The octal escape for the byte 0xe9 is \351, so to correctly submit your
query to PostgreSQL you'd need to send:

   select convert(E'Vall\\351e','LATIN1','UTF8');

or if you really just want to convert to the server encoding:

   select convert_from(E'Vall\\351e','LATIN-1');

Here, E'' is used for the string to tell PostgreSQL to interpret
escapes, and \\351 is used as the octal escape for 0xe9. We double the
backslashes because the first backslash gets used up parsing the
literal, and we want the bytea value to actually be 'Val\351e' after
string literal parsing.

Instead, if standard_conforming_strings is on, you can omit the E'' and
use single backslashes.

Note that in 8.5 and above you could use E'Val\\xe9e' instead of the
octal form, as 8.5 understands hex escapes in bytea values.

I guess bytea is kind of ugly to use in a text-based protocol :-(
because the query string has to be valid text in the client_encoding, so
bytea values containing data not valid in the client encoding must be
escaped in the parsed query string. There's nothing much to be done
about it, though.

--
Craig Ringer

В списке pgsql-bugs по дате отправления:

Предыдущее
От: "DM"
Дата:
Сообщение: BUG #5221: errors converting string to utf8
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #5221: errors converting string to utf8