Обсуждение: Problem with langage encoding

Поиск
Список
Период
Сортировка

Problem with langage encoding

От
"Ronald Vyhmeister"
Дата:
I've inherited a database system (including source)...  It's Russian text
(which I don't speak, but I need to work on it).

On my screen it shows correctly, but in the database it shows a mess...
from trying with a web browser.  It appears that the data in the database is
encoded with ISO-8859-1 (my browser and the database viewed with PGAdmin
match when my browser has ISO-8859-1 encoding)....  and it shows correctly
on screen in my browser when I use the WIN1251 encoding...

The problem is that the database is UTF8, and it won't let me use the
convert_to function because it says that the characters don't exist... What
I need to do is to pull the UTF8 from the database, tell postgres that it's
8859-1, and then convert it to WIN1251... How?

Thank you for any assistance!

Ron




Re: Problem with langage encoding

От
Raymond O'Donnell
Дата:
On 26/11/2008 14:40, Ronald Vyhmeister wrote:

> The problem is that the database is UTF8, and it won't let me use the
> convert_to function because it says that the characters don't exist... What
> I need to do is to pull the UTF8 from the database, tell postgres that it's
> 8859-1, and then convert it to WIN1251... How?

If the DB is in UTF8, you ought to be able to issue

  set client_encoding to 'WIN1251'

after connection, and the conversion happens automatically. See:

  http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483

HTH,

Ray.


------------------------------------------------------------------
Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland
rod@iol.ie
Galway Cathedral Recitals: http://www.galwaycathedral.org/recitals
------------------------------------------------------------------

Re: Problem with langage encoding

От
"Ronald Vyhmeister"
Дата:

> The problem is that the database is UTF8, and it won't let me use the
> convert_to function because it says that the characters don't exist...
What
> I need to do is to pull the UTF8 from the database, tell postgres that
it's
> 8859-1, and then convert it to WIN1251... How?

>If the DB is in UTF8, you ought to be able to issue

>  set client_encoding to 'WIN1251'

>after connection, and the conversion happens automatically. See:

>  http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483

>HTH,

>Ray.

Wish it would work...  when I do it, I get:

ERROR:  character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"
********** Error **********
ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"
SQL state: 22P05

The DB is storing it UTF8, but it is really 8859-1...  I need to force PG to
think that it's 8859-1 in spite of what it thinks it may be...

Ron


Re: Problem with langage encoding

От
Tom Lane
Дата:
"Ronald Vyhmeister" <rvyhmeister@gmail.com> writes:
> The DB is storing it UTF8, but it is really 8859-1...  I need to force PG to
> think that it's 8859-1 in spite of what it thinks it may be...

Dump the database, change the "set client_encoding" command in the
resulting file, reload.

            regards, tom lane

Re: Problem with langage encoding

От
Craig Ringer
Дата:
Ronald Vyhmeister wrote:
>
>> The problem is that the database is UTF8, and it won't let me use the
>> convert_to function because it says that the characters don't exist...
> What
>> I need to do is to pull the UTF8 from the database, tell postgres that
> it's
>> 8859-1, and then convert it to WIN1251... How?
>
>> If the DB is in UTF8, you ought to be able to issue
>
>>  set client_encoding to 'WIN1251'
>
>> after connection, and the conversion happens automatically. See:
>
>>  http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483
>
>> HTH,
>
>> Ray.
>
> Wish it would work...  when I do it, I get:
>
> ERROR:  character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"

Which it does not; that character is "쎽" (HANGUL SYLLABLE SSYEG) which
certainly isn't in WIN1251 or in latin-1 (ISO-8859-1). The byte sequence
for this character in UTF-8 is:

    0xec 0x8e 0xbd

When decoded as latin-1, those three bytes are interpreted as: "쎽" ...
which doesn't seem much more likely to be valid. Are you SURE it's
latin-1, not (say) Koi_r-8?



Python is a useful tool for investigating encoding problems. It draws a
strong distinction between true Unicode strings and byte strings. It
provides quality conversion routines that let you specify the encoding
of the byte string, and that'll throw an exception for nonsensical input.

This lets you take a byte string, decode it according to various
encodings, and see what you land up with. You can also test various
horrible encoding mangling schemes used by apps rather easily. The
python `encodings' module contains a list of all known encodings by all
aliases as `encodings.aliases.aliases'. Here's some code that takes an
input byte string and prints a table of all ways it can be interpreted
according to various encodings:

-------------------
import sys
import encodings

stupid_encodings = ['zlib_codec', 'uu_codec', 'hex_codec', 'bz2_codec']

bs = '\xc3\xbd'
for encoding in set(encodings.aliases.aliases.values()):
  sys.stdout.write(u"%20s: " % encoding)
  sys.stdout.flush()
  if encoding in stupid_encodings:
    continue
  try:
    print u"\"%6s\" (%6s)" % (bs.decode(encoding),
repr(bs.decode(encoding)))
  except UnicodeDecodeError,e:
    print u"         [INVALID]"
  except LookupError,e:
    print u"         [UNSUPPORTED]"

--------------------


... and here's the output for the data you mentioned:


--------------------
           bz2_codec:               cp1140: "    C¨" (u'C\xa8')
              rot_13: "    ý" (u'\xc3\xbd')
               cp932: "    テス" (u'\uff83\uff7d')
        euc_jisx0213: "     箪" (u'\u7baa')
               cp037: "    C¨" (u'C\xa8')
           hex_codec:                cp500: "    C¨" (u'C\xa8')
            uu_codec:            big5hkscs: "     羸" (u'\u7fb8')
                mbcs:          [UNSUPPORTED]
        euc_jis_2004: "     箪" (u'\u7baa')
        iso2022_jp_3:          [INVALID]
        iso2022_jp_2:          [INVALID]
        iso2022_jp_1:          [INVALID]
                 gbk: "     媒" (u'\u5a92')
     iso2022_jp_2004:          [INVALID]
        quopri_codec:          [INVALID]
               cp424: "    C¨" (u'C\xa8')
          iso2022_jp:          [INVALID]
         mac_iceland: "    √Ω" (u'\u221a\u03a9')
           hp_roman8: "    û§" (u'\xfb\xa7')
          iso2022_kr:          [INVALID]
              euc_kr: "     첵" (u'\uccb5')
              cp1254: "    ý" (u'\xc3\xbd')
              gb2312: "     媒" (u'\u5a92')
               cp850: "    Û" (u'\u251c\xa2')
           shift_jis: "    テス" (u'\uff83\uff7d')
               cp852: "    ├Ż" (u'\u251c\u017b')
               cp855: "    ├й" (u'\u251c\u0439')
           utf_16_le: "     뷃" (u'\ubdc3')
               cp857: "    Û" (u'\u251c\xa2')
               cp775: "    ├Į" (u'\u251c\u012e')
              cp1026: "    C¨" (u'C\xa8')
          mac_latin2: "    √Ĺ" (u'\u221a\u0139')
        mac_cyrillic: "    √љ" (u'\u221a\u0459')
        base64_codec: "      " (    '')
             ptcp154: "    ГҪ" (u'\u0413\u04aa')
              euc_jp: "     箪" (u'\u7baa')
                  hz:          [INVALID]
               utf_8: "     ý" (u'\xfd')
           mac_greek: "    ΟΫ" (u'\u039f\u03ab')
               utf_7:          [INVALID]
         mac_turkish: "    √Ω" (u'\u221a\u03a9')
               cp949: "     첵" (u'\uccb5')
          zlib_codec:                 big5: "     羸" (u'\u7fb8')
           iso8859_9: "    ý" (u'\xc3\xbd')
           iso8859_8:          [INVALID]
           iso8859_5: "    УН" (u'\u0423\u041d')
           iso8859_4: "    ÃŊ" (u'\xc3\u014a')
           iso8859_7: "    Γ½" (u'\u0393\xbd')
           iso8859_6:          [INVALID]
           iso8859_3:          [INVALID]
           iso8859_2: "    Ă˝" (u'\u0102\u02dd')
             gb18030: "     媒" (u'\u5a92')
      shift_jis_2004: "    テス" (u'\uff83\uff7d')
           mac_roman: "    √Ω" (u'\u221a\u03a9')
               cp950: "     羸" (u'\u7fb8')
              utf_16: "     뷃" (u'\ubdc3')
          iso8859_15: "    Ü" (u'\xc3\u0153')
          iso8859_14: "    ÃẄ" (u'\xc3\u1e84')
             tis_620: "    รฝ" (u'\u0e23\u0e1d')
          iso8859_16: "    Ăœ" (u'\u0102\u0153')
          iso8859_11: "    รฝ" (u'\u0e23\u0e1d')
          iso8859_10: "    Ã―" (u'\xc3\u2015')
              tactis:          [UNSUPPORTED]
               ascii:          [INVALID]
               cp869: "    ├Ξ" (u'\u251c\u039e')
               cp860: "    ├╜" (u'\u251c\u255c')
               cp861: "    ├╜" (u'\u251c\u255c')
               cp862: "    ├╜" (u'\u251c\u255c')
               cp863: "    ├╜" (u'\u251c\u255c')
               cp864: "    ﺃﺵ" (u'\ufe83\ufeb5')
               cp865: "    ├╜" (u'\u251c\u255c')
               cp866: "    ├╜" (u'\u251c\u255c')
      shift_jisx0213: "    テス" (u'\uff83\uff7d')
              cp1255: "    ֳ½" (u'\u05b3\xbd')
             latin_1: "    ý" (u'\xc3\xbd')
              cp1257: "    Ć½" (u'\u0106\xbd')
              cp1256: "    أ½" (u'\u0623\xbd')
              cp1251: "    ГЅ" (u'\u0413\u0405')
              cp1250: "    Ă˝" (u'\u0102\u02dd')
              cp1253: "    Γ½" (u'\u0393\xbd')
              cp1252: "    ý" (u'\xc3\xbd')
               cp437: "    ├╜" (u'\u251c\u255c')
              cp1258: "    Ă½" (u'\u0102\xbd')
          iso8859_13: "    Ć½" (u'\u0106\xbd')
              koi8_r: "    ц╫" (u'\u0446\u256b')
           utf_16_be: "     쎽" (u'\uc3bd')
               johab: "     칳" (u'\uce73')
      iso2022_jp_ext:          [INVALID]
--------------------


I don't see anything particularly striking there, but (a) this is a tiny
sample not a useful string, (b) I'm relying on my interpretation of what
the original byte string must've been to be correct, (c) Python's
encoding support isn't absolutely comprehensive, and (d) I'm assuming
the original inserting app only made a simple error in populating the
database, rather than doing something really exciting with its text.


Can you provide a more complete sample of the text in the DB? It'd help
if you could show the CAST(stringvar AS bytea) value of the strings,
which will provide the byte sequence the string is stored as in the
server encoding.

--
Craig Ringer

Re: Problem with langage encoding

От
Stephane Bortzmeyer
Дата:
On Thu, Nov 27, 2008 at 02:34:17AM +0900,
 Craig Ringer <craig@postnewspapers.com.au> wrote
 a message of 188 lines which said:

> > ERROR:  character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"
>
> Which it does not; that character is "???" (HANGUL SYLLABLE SSYEG)

No, I don't think so. I think that 0xc3bd is the binary value, and
0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is
less surprising. It is in Latin-1 but not in WIN1251.

In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if
the database was really in Latin-1.

Re: Problem with langage encoding

От
Craig Ringer
Дата:
Stephane Bortzmeyer wrote:
> On Thu, Nov 27, 2008 at 02:34:17AM +0900,
>  Craig Ringer <craig@postnewspapers.com.au> wrote
>  a message of 188 lines which said:
>
>>> ERROR:  character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"
>> Which it does not; that character is "???" (HANGUL SYLLABLE SSYEG)
>
> No, I don't think so. I think that 0xc3bd is the binary value, and
> 0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is
> less surprising. It is in Latin-1 but not in WIN1251.

Yes, that would make a lot more sense. From there it's easy, see below:

> In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if
> the database was really in Latin-1.

OK, so what we end up with is the following sequence (expressed in
Python, which is always handy for encoding conversions etc):

print "\xc3\xbd".decode("utf-8").encode("latin-1").decode("koi8_r")

to yield: Щ

Assuming that's right, what presumably happened is that the database was
initialized with UTF-8 and loaded with data it was told was in the
latin-1 encoding, but was actually in the  koi8_r encoding. Therefore,
utf-8 sequences were generated based on the interpretation of the bytes
for each koi8_r character as the latin-1 character for the same byte
value, so:

Input file: 0xfd (latin-1: ý, koi8_r: Щ)
       |
       | (input interpreted as latin-1)
       v
Database:   0xc3bd (utf-8: ý)

To recover the data you must reverse that process. Thankfully it's going
to be 100% reversible, ie no information has been lost.

To create a tiny test table for the following explanation and fix code I
just ran:

create table ss ( x text) ;
insert into ss (x) values (E'\xc3\xbd');

Now if I:

set client_encoding = "WIN1251";

I get:

test=> select * from ss;
ERROR:  character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"

just like you.

With client_encoding set as utf-8 (which is what my machine is) I get:

test=> select * from ss;
 x
---
 ý
(1 row)



PostgreSQL's convert() function:
http://www.postgresql.org/docs/current/static/functions-string.html

may be used now to transform your data. It doesn't assume any encoding
for the input string, unlike convert_from and convert_to, so you can use
a statement like this to convert your data:(where 'x' is the string of
mangled data to be converted):

select (
convert_from(
  convert(x::bytea, 'utf-8', 'latin-1'),
          'koi8_r')
) from ss;

In other words: "Decode the utf-8 sequence in the input and map each
utf-8 code point to the corresponding latin-1 character, outputting one
byte per latin-1 character. Interpret the sequence of bytes just
produced as a sequence of characters in the koi8_r encoding, and map
them to the same characters in the database's internal encoding."

To copy the converted data to a new table:

CREATE TABLE converted (y text);
INSERT INTO converted(y)
SELECT (
convert_from(
  convert(x::bytea, 'utf-8', 'latin-1'),
          'koi8_r')
) from ss;

Now if I SELECT from the table of converted data, I get the right(?) output:

test=> select * from converted;
 y
---
 Щ
(1 row)

You can easily wrap that up in a simple SQL function:

CREATE OR REPLACE FUNCTION fixstr(text) RETURNS text AS $$
SELECT convert_from(convert($1::bytea, 'utf-8', 'latin-1'),'koi8_r')
$$ LANGUAGE 'SQL' IMMUTABLE;

so you can just:

test=> select fixstr(x) from ss;
 fixstr
--------
 Щ
(1 row)

--
Craig Ringer