Обсуждение: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

Поиск
Список
Период
Сортировка

8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

От
Sergey Burladyan
Дата:
Hi, all !

I can't convert with convert(bytea, name, name)::bytea from 'iso-8859-5'=20
to 'windows-1251' or any other cyrillic 8-bit encoding.

seb=3D> show client_encoding ;
 client_encoding
-----------------
 UTF8

seb=3D> show server_encoding;
 server_encoding
-----------------
 UTF8

seb=3D> select version();
                                        version
---------------------------------------------------------------------------=
-------------
 PostgreSQL 8.3.0 on i486-pc-linux-gnu, compiled by GCC cc (GCC) 4.2.3 (Deb=
ian=20
4.2.3-1)

 lc_collate                      | ru_RU.UTF-8
 lc_ctype                        | ru_RU.UTF-8
 lc_messages                     | ru_RU.UTF-8
 lc_monetary                     | ru_RU.UTF-8
 lc_numeric                      | ru_RU.UTF-8
 lc_time                         | ru_RU.UTF-8

seb=3D> select=20
convert(convert('=D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=B6=D0=B7=D0=
=B8=D0=B9=D0=BA=D0=BB=D0=BC=D0=BD=D0=BE=D0=BF=D1=80=D1=81=D1=82=D1=83=D1=84=
=D1=85=D1=86=D1=87=D1=88=D1=89=D1=8A=D1=8B=D1=8C=D1=8D=D1=8E=D1=8F=D0=90=D0=
=91=D0=92=D0=93=D0=94=D0=95=D0=81=D0=96=D0=97=D0=98=D0=99=D0=9A=D0=9B=D0=9C=
=D0=9D=D0=9E=D0=9F=D0=A0=D0=A1=D0=A2=D0=A3=D0=A4=D0=A5=D0=A6=D0=A7=D0=A8=D0=
=A9=D0=AA=D0=AB=D0=AC=D0=AD=D0=AE=D0=AF', 'utf-8', 'iso-8859-5'), 'iso-8859=
-5', 'windows-1251');
ERROR:  character 0xf1 of encoding "ISO_8859_5" has no equivalent=20
in "MULE_INTERNAL"

At first - i am convert my console locale encoding (ru_RU.UTF-8) to iso-885=
9-5=20
(cyrillic 8-bit character encoding) and second convert is for show problem.

windows-1251 - is other cyrillic 8-bit character encoding, convert to koi8-=
r=20
also not work.

i am write output of convert(..., 'utf-8', 'iso-8859-5') into file and read=
 it=20
with: iconv -f iso-8859-5 -- all chars readed ok. (see progs in attach)

convert(..., 'iso-8859-5', 'utf-8') looking good, i am check it like this:
seb=3D> set standard_conforming_strings TO on; --- do not escape bytea
SET
seb=3D> select=20
convert('\320\321\322\323\324\325\361\326\327\330\331\332\333\334\335\336\3=
37\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\260\261\=
262\263\264\265\241\266\267\270\271\272\273\274\275\276\277\300\301\302\303=
\304\305\306\307\310\311\312\313\314\315\316\317', 'iso-8859-5', 'utf-8');
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20
convert=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
-----
=20
\320\260\320\261\320\262\320\263\320\264\320\265\321\221\320\266\320\267\32=
0\270\320\271\320\272\320\273\320\274\320\275\320\276\320\277\321\200\321\2=
01\321\202\321\203\321\204\321\205\321\206\321\207\321\210\321\211\321\212\=
321\213\321\214\321\215\321\216\321\217\320\220\320\221\320\222\320\223\320=
\224\320\225\320\201\320\226\320\227\320\230\320\231\320\232\320\233\320\23=
4\320\235\320\236\320\237\320\240\320\241\320\242\320\243\320\244\320\245\3=
20\246\320\247\320\250\320\251\320\252\320\253\320\254\320\255\320\256\320\=
257
(1 =D0=B7=D0=B0=D0=BF=D0=B8=D1=81=D1=8C)

seb=3D> set standard_conforming_strings TO off; --- now we must escaping by=
tea=20
for show text
SET
seb=3D> select=20
E'\320\260\320\261\320\262\320\263\320\264\320\265\321\221\320\266\320\267\=
320\270\320\271\320\272\320\273\320\274\320\275\320\276\320\277\321\200\321=
\201\321\202\321\203\321\204\321\205\321\206\321\207\321\210\321\211\321\21=
2\321\213\321\214\321\215\321\216\321\217\320\220\320\221\320\222\320\223\3=
20\224\320\225\320\201\320\226\320\227\320\230\320\231\320\232\320\233\320\=
234\320\235\320\236\320\237\320\240\320\241\320\242\320\243\320\244\320\245=
\320\246\320\247\320\250\320\251\320\252\320\253\320\254\320\255\320\256\32=
0\257';
                              ?column?
--------------------------------------------------------------------
 =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=B6=D0=B7=D0=B8=D0=B9=D0=BA=
=D0=BB=D0=BC=D0=BD=D0=BE=D0=BF=D1=80=D1=81=D1=82=D1=83=D1=84=D1=85=D1=86=D1=
=87=D1=88=D1=89=D1=8A=D1=8B=D1=8C=D1=8D=D1=8E=D1=8F=D0=90=D0=91=D0=92=D0=93=
=D0=94=D0=95=D0=81=D0=96=D0=97=D0=98=D0=99=D0=9A=D0=9B=D0=9C=D0=9D=D0=9E=D0=
=9F=D0=A0=D0=A1=D0=A2=D0=A3=D0=A4=D0=A5=D0=A6=D0=A7=D0=A8=D0=A9=D0=AA=D0=AB=
=D0=AC=D0=AD=D0=AE=D0=AF
(1 =D0=B7=D0=B0=D0=BF=D0=B8=D1=81=D1=8C)

it os ok.

text string parameter is russian alphabet from first letter to last, lower=
=20
case, and from first letter to last, UPPER case

may be i am doing something wrong ?

---

Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

От
Sergey Burladyan
Дата:
Hi, all !

i'm find the problem.

src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c
does not have cyrillic letter 'IO' in ISO-8859-5 to mule internal code
translation table (function iso2mic(const unsigned char *l, unsigned char *p,
int len)). this is bug, because it is widely used and it is main letter like
A, B or C in english :) and it is exist in all russian cyrillic's encoding
(koi8-r, iso-8859-5, windows-1251, cp866).
for example, in russian, words 'all', 'hedgehog', 'Christmas-tree' and many
other must be written with it.

here is the patch for add it to ISO-8859-5 to mule internal code translation
table. i am don't know is this ok and do not brake any internal rule or
code ?

By the way, as i can understand you are using koi8-r encoding for internal
representation of cyrillic charsets - this is have also another problem. the
second "widely" used char is <U2116> NUMERO SIGN (many accountants and
managers use it :) in cyrillic windows world) and it is exist in
windows-1251, cp866 and iso-8859-5 encoding, but not in koi8-r...

---

Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

От
"Heikki Linnakangas"
Дата:
Sergey Burladyan wrote:
> src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c
> does not have cyrillic letter 'IO' in ISO-8859-5 to mule internal code
> translation table (function iso2mic(const unsigned char *l, unsigned char *p,
> int len)). this is bug, because it is widely used and it is main letter like
> A, B or C in english :) and it is exist in all russian cyrillic's encoding
> (koi8-r, iso-8859-5, windows-1251, cp866).
> for example, in russian, words 'all', 'hedgehog', 'Christmas-tree' and many
> other must be written with it.
>
> here is the patch for add it to ISO-8859-5 to mule internal code translation
> table. i am don't know is this ok and do not brake any internal rule or
> code ?

You'd need to modify the mic->ISO-8859-5 translation table as well, for
converting in the other direction.

> By the way, as i can understand you are using koi8-r encoding for internal
> representation of cyrillic charsets - this is have also another problem. the
> second "widely" used char is <U2116> NUMERO SIGN (many accountants and
> managers use it :) in cyrillic windows world) and it is exist in
> windows-1251, cp866 and iso-8859-5 encoding, but not in koi8-r...

Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an
intermediate encoding, because there's no direct conversion table
between ISO-8859-5 and the other cyrillic encodings. Ideally there would
be. Another possibility would be to use UTF-8 as the intermediate
encoding; that'd probably be much slower, but UTF-8 should have all the
characters needed.

Is there any other characters like "YO" that are missing, that exist in
all the encodings? Looking at the character set table for KOI8-R, it
looks like the "YO" is in an odd place in the table, compared to all
other cyrillic characters. Perhaps that's why it was missed.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

От
"Heikki Linnakangas"
Дата:
Heikki Linnakangas wrote:
> Sergey Burladyan wrote:
>> src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c
>> does not have cyrillic letter 'IO' in ISO-8859-5 to mule internal code
>> translation table (function iso2mic(const unsigned char *l, unsigned
>> char *p, int len)). this is bug, because it is widely used and it is
>> main letter like A, B or C in english :) and it is exist in all
>> russian cyrillic's encoding (koi8-r, iso-8859-5, windows-1251, cp866).
>> for example, in russian, words 'all', 'hedgehog', 'Christmas-tree' and
>> many other must be written with it.
>>
>> here is the patch for add it to ISO-8859-5 to mule internal code
>> translation table. i am don't know is this ok and do not brake any
>> internal rule or code ?
>
> You'd need to modify the mic->ISO-8859-5 translation table as well, for
> converting in the other direction.

Here's a patch that does the conversion in the other direction as well.
As I'm not too familiar with cyrillic, can you double-check that this
works? I tested it using the convert() function between different
encodings, and it seems ok to me.

>> By the way, as i can understand you are using koi8-r encoding for
>> internal representation of cyrillic charsets - this is have also
>> another problem. the second "widely" used char is <U2116> NUMERO SIGN
>> (many accountants and managers use it :) in cyrillic windows world)
>> and it is exist in windows-1251, cp866 and iso-8859-5 encoding, but
>> not in koi8-r...
>
> Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an
> intermediate encoding, because there's no direct conversion table
> between ISO-8859-5 and the other cyrillic encodings. Ideally there would
> be. Another possibility would be to use UTF-8 as the intermediate
> encoding; that'd probably be much slower, but UTF-8 should have all the
> characters needed.
>
> Is there any other characters like "YO" that are missing, that exist in
> all the encodings? Looking at the character set table for KOI8-R, it
> looks like the "YO" is in an odd place in the table, compared to all
> other cyrillic characters. Perhaps that's why it was missed.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c
===================================================================
RCS file:
/home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c,v
retrieving revision 1.16
diff -c -r1.16 cyrillic_and_mic.c
*** src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c    1 Jan 2008 19:45:53 -0000    1.16
--- src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c    19 Mar 2008 21:04:40 -0000
***************
*** 483,489 ****
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
!         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0xe1, 0xe2, 0xf7, 0xe7, 0xe4, 0xe5, 0xf6, 0xfa,
          0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, 0xf0,
--- 483,489 ----
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
!         0x00, 0xb3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0xe1, 0xe2, 0xf7, 0xe7, 0xe4, 0xe5, 0xf6, 0xfa,
          0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, 0xf0,
***************
*** 493,499 ****
          0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, 0xd0,
          0xd2, 0xd3, 0xd4, 0xd5, 0xc6, 0xc8, 0xc3, 0xde,
          0xdb, 0xdd, 0xdf, 0xd9, 0xd8, 0xdc, 0xc0, 0xd1,
!         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
      };

--- 493,499 ----
          0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, 0xd0,
          0xd2, 0xd3, 0xd4, 0xd5, 0xc6, 0xc8, 0xc3, 0xde,
          0xdb, 0xdd, 0xdf, 0xd9, 0xd8, 0xdc, 0xc0, 0xd1,
!         0x00, 0xa3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
      };

***************
*** 509,517 ****
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
!         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
!         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0xee, 0xd0, 0xd1, 0xe6, 0xd4, 0xd5, 0xe4, 0xd3,
          0xe5, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde,
--- 509,517 ----
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+         0x00, 0x00, 0x00, 0xf1, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
!         0x00, 0x00, 0x00, 0xa1, 0x00, 0x00, 0x00, 0x00,
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
          0xee, 0xd0, 0xd1, 0xe6, 0xd4, 0xd5, 0xe4, 0xd3,
          0xe5, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde,

Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

От
Sergey Burladyan
Дата:
Thursday 20 March 2008 01:16:34 Heikki Linnakangas:

Thanks for answer, Heikki !

> You'd need to modify the mic->ISO-8859-5 translation table as well, for
> converting in the other direction.
oops, i have not thought about it %)

> Here's a patch that does the conversion in the other direction as well.
> As I'm not too familiar with cyrillic, can you double-check that this
> works? I tested it using the convert() function between different
> encodings, and it seems ok to me.

yes, i test it with function like this and it work now :)

create or replace function test_convert() returns setof record as $$
declare
  --- russian alphabet, 33 upper and 33 lower letters in utf-8 encoding
  r bytea default=20
E'\320\260\320\261\320\262\320\263\320\264\320\265\321\221\320\266\320\267\=
320\270\320\271\320\272\320\273\320\274\320\275\320\276\320\277\321\200\321=
\201\321\202\321\203\321\204\321\205\321\206\321\207\321\210\321\211\321\21=
2\321\213\321\214\321\215\321\216\321\217\320\220\320\221\320\222\320\223\3=
20\224\320\225\320\201\320\226\320\227\320\230\320\231\320\232\320\233\320\=
234\320\235\320\236\320\237\320\240\320\241\320\242\320\243\320\244\320\245=
\320\246\320\247\320\250\320\251\320\252\320\253\320\254\320\255\320\256\32=
0\257';
  s bytea; --- converted to result
  t bytea; --- converted back result
  res record;
begin
  raise notice 'russian ABC: "%"', encode(r, 'escape');
  s :=3D convert(r, 'utf-8', 'iso-8859-5');

  t :=3D convert(s, 'iso-8859-5', 'windows-1251'); t :=3D=20
convert(t, 'windows-1251', 'utf-8');
  if t !=3D r then
     raise exception 'iso-8859-5, windows-1251 | t !=3D r';
  end if;
  res :=3D row('iso-8859-5, windows-1251'::text, encode(
=20=20=20=20=20=20
convert(convert(s, 'iso-8859-5', 'windows-1251'), 'windows-1251', 'utf-8')
      , 'escape')::text
  );
  return next res;
[...skip...]

seb=3D# select * from test_convert() as (conv text, res text);
NOTICE:  russian ABC: "=D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=B6=D0=
=B7..."
            conv            |    res
----------------------------+-----------
 iso-8859-5, windows-1251   | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 iso-8859-5, windows-866    | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 iso-8859-5, koi8-r         | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 iso-8859-5, iso-8859-5     | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 windows-866, windows-1251  | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 windows-866, iso-8859-5    | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 windows-866, koi8-r        | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 windows-866, windows-866   | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 windows-1251, windows-866  | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 windows-1251, iso-8859-5   | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 windows-1251, koi8-r       | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 windows-1251, windows-1251 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 koi8-r, windows-866        | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 koi8-r, iso-8859-5         | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 koi8-r, windows-1251       | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
 koi8-r, koi8-r             | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=
=B6=D0=B7...
(16 rows)

> Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an
> intermediate encoding, because there's no direct conversion table
> between ISO-8859-5 and the other cyrillic encodings. Ideally there would
> be. Another possibility would be to use UTF-8 as the intermediate
> encoding; that'd probably be much slower, but UTF-8 should have all the
> characters needed.
I think that UTF-8 is too complex for translate 8-bit charset to another 8-=
bit=20
charset, but other solution is many many translate tables... hard question =
%)

> Is there any other characters like "YO" that are missing, that exist in
> all the encodings?=20
if we say about alphabet letters, the answer is - No, only "YO" was missing.
if we say about any character, there is 'NO-BREAK SPACE' (U+00A0) it exist =
in=20
1251, 866, koi8-r and iso but i do not think that it widely used...

> Looking at the character set table for KOI8-R, it=20
> looks like the "YO" is in an odd place in the table, compared to all
> other cyrillic characters. Perhaps that's why it was missed.
Yes, i understand. russian character sets always been a challenge for all=
=20
programmers :) it are at least five, and it are all different

Thanks for patch, Heikki !

---

Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

От
"Heikki Linnakangas"
Дата:
Sergey Burladyan wrote:
> Thursday 20 March 2008 01:16:34 Heikki Linnakangas:
>> Here's a patch that does the conversion in the other direction as well.
>> As I'm not too familiar with cyrillic, can you double-check that this
>> works? I tested it using the convert() function between different
>> encodings, and it seems ok to me.
>
> yes, i test it with function like this and it work now :)

Ok, patch applied.

>> Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an
>> intermediate encoding, because there's no direct conversion table
>> between ISO-8859-5 and the other cyrillic encodings. Ideally there would
>> be. Another possibility would be to use UTF-8 as the intermediate
>> encoding; that'd probably be much slower, but UTF-8 should have all the
>> characters needed.
> I think that UTF-8 is too complex for translate 8-bit charset to another 8-bit
> charset, but other solution is many many translate tables... hard question %)

Yeah. It's probably not worth the effort to change/test it. Apparently
there's not many people using these conversion functions, as the bug has
been there since 7.3 and you're the first one to notice.

>> Is there any other characters like "YO" that are missing, that exist in
>> all the encodings?
> if we say about alphabet letters, the answer is - No, only "YO" was missing.
> if we say about any character, there is 'NO-BREAK SPACE' (U+00A0) it exist in
> 1251, 866, koi8-r and iso but i do not think that it widely used...

Ok, good.

Thanks for the report and the patch!

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com