Обсуждение: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
От
Sergey Burladyan
Дата:
Hi, all ! I can't convert with convert(bytea, name, name)::bytea from 'iso-8859-5'=20 to 'windows-1251' or any other cyrillic 8-bit encoding. seb=3D> show client_encoding ; client_encoding ----------------- UTF8 seb=3D> show server_encoding; server_encoding ----------------- UTF8 seb=3D> select version(); version ---------------------------------------------------------------------------= ------------- PostgreSQL 8.3.0 on i486-pc-linux-gnu, compiled by GCC cc (GCC) 4.2.3 (Deb= ian=20 4.2.3-1) lc_collate | ru_RU.UTF-8 lc_ctype | ru_RU.UTF-8 lc_messages | ru_RU.UTF-8 lc_monetary | ru_RU.UTF-8 lc_numeric | ru_RU.UTF-8 lc_time | ru_RU.UTF-8 seb=3D> select=20 convert(convert('=D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=B6=D0=B7=D0= =B8=D0=B9=D0=BA=D0=BB=D0=BC=D0=BD=D0=BE=D0=BF=D1=80=D1=81=D1=82=D1=83=D1=84= =D1=85=D1=86=D1=87=D1=88=D1=89=D1=8A=D1=8B=D1=8C=D1=8D=D1=8E=D1=8F=D0=90=D0= =91=D0=92=D0=93=D0=94=D0=95=D0=81=D0=96=D0=97=D0=98=D0=99=D0=9A=D0=9B=D0=9C= =D0=9D=D0=9E=D0=9F=D0=A0=D0=A1=D0=A2=D0=A3=D0=A4=D0=A5=D0=A6=D0=A7=D0=A8=D0= =A9=D0=AA=D0=AB=D0=AC=D0=AD=D0=AE=D0=AF', 'utf-8', 'iso-8859-5'), 'iso-8859= -5', 'windows-1251'); ERROR: character 0xf1 of encoding "ISO_8859_5" has no equivalent=20 in "MULE_INTERNAL" At first - i am convert my console locale encoding (ru_RU.UTF-8) to iso-885= 9-5=20 (cyrillic 8-bit character encoding) and second convert is for show problem. windows-1251 - is other cyrillic 8-bit character encoding, convert to koi8-= r=20 also not work. i am write output of convert(..., 'utf-8', 'iso-8859-5') into file and read= it=20 with: iconv -f iso-8859-5 -- all chars readed ok. (see progs in attach) convert(..., 'iso-8859-5', 'utf-8') looking good, i am check it like this: seb=3D> set standard_conforming_strings TO on; --- do not escape bytea SET seb=3D> select=20 convert('\320\321\322\323\324\325\361\326\327\330\331\332\333\334\335\336\3= 37\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\260\261\= 262\263\264\265\241\266\267\270\271\272\273\274\275\276\277\300\301\302\303= \304\305\306\307\310\311\312\313\314\315\316\317', 'iso-8859-5', 'utf-8'); =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20 convert=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20 ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ----- =20 \320\260\320\261\320\262\320\263\320\264\320\265\321\221\320\266\320\267\32= 0\270\320\271\320\272\320\273\320\274\320\275\320\276\320\277\321\200\321\2= 01\321\202\321\203\321\204\321\205\321\206\321\207\321\210\321\211\321\212\= 321\213\321\214\321\215\321\216\321\217\320\220\320\221\320\222\320\223\320= \224\320\225\320\201\320\226\320\227\320\230\320\231\320\232\320\233\320\23= 4\320\235\320\236\320\237\320\240\320\241\320\242\320\243\320\244\320\245\3= 20\246\320\247\320\250\320\251\320\252\320\253\320\254\320\255\320\256\320\= 257 (1 =D0=B7=D0=B0=D0=BF=D0=B8=D1=81=D1=8C) seb=3D> set standard_conforming_strings TO off; --- now we must escaping by= tea=20 for show text SET seb=3D> select=20 E'\320\260\320\261\320\262\320\263\320\264\320\265\321\221\320\266\320\267\= 320\270\320\271\320\272\320\273\320\274\320\275\320\276\320\277\321\200\321= \201\321\202\321\203\321\204\321\205\321\206\321\207\321\210\321\211\321\21= 2\321\213\321\214\321\215\321\216\321\217\320\220\320\221\320\222\320\223\3= 20\224\320\225\320\201\320\226\320\227\320\230\320\231\320\232\320\233\320\= 234\320\235\320\236\320\237\320\240\320\241\320\242\320\243\320\244\320\245= \320\246\320\247\320\250\320\251\320\252\320\253\320\254\320\255\320\256\32= 0\257'; ?column? -------------------------------------------------------------------- =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=B6=D0=B7=D0=B8=D0=B9=D0=BA= =D0=BB=D0=BC=D0=BD=D0=BE=D0=BF=D1=80=D1=81=D1=82=D1=83=D1=84=D1=85=D1=86=D1= =87=D1=88=D1=89=D1=8A=D1=8B=D1=8C=D1=8D=D1=8E=D1=8F=D0=90=D0=91=D0=92=D0=93= =D0=94=D0=95=D0=81=D0=96=D0=97=D0=98=D0=99=D0=9A=D0=9B=D0=9C=D0=9D=D0=9E=D0= =9F=D0=A0=D0=A1=D0=A2=D0=A3=D0=A4=D0=A5=D0=A6=D0=A7=D0=A8=D0=A9=D0=AA=D0=AB= =D0=AC=D0=AD=D0=AE=D0=AF (1 =D0=B7=D0=B0=D0=BF=D0=B8=D1=81=D1=8C) it os ok. text string parameter is russian alphabet from first letter to last, lower= =20 case, and from first letter to last, UPPER case may be i am doing something wrong ? ---
Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
От
Sergey Burladyan
Дата:
Hi, all ! i'm find the problem. src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c does not have cyrillic letter 'IO' in ISO-8859-5 to mule internal code translation table (function iso2mic(const unsigned char *l, unsigned char *p, int len)). this is bug, because it is widely used and it is main letter like A, B or C in english :) and it is exist in all russian cyrillic's encoding (koi8-r, iso-8859-5, windows-1251, cp866). for example, in russian, words 'all', 'hedgehog', 'Christmas-tree' and many other must be written with it. here is the patch for add it to ISO-8859-5 to mule internal code translation table. i am don't know is this ok and do not brake any internal rule or code ? By the way, as i can understand you are using koi8-r encoding for internal representation of cyrillic charsets - this is have also another problem. the second "widely" used char is <U2116> NUMERO SIGN (many accountants and managers use it :) in cyrillic windows world) and it is exist in windows-1251, cp866 and iso-8859-5 encoding, but not in koi8-r... ---
Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
От
"Heikki Linnakangas"
Дата:
Sergey Burladyan wrote: > src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c > does not have cyrillic letter 'IO' in ISO-8859-5 to mule internal code > translation table (function iso2mic(const unsigned char *l, unsigned char *p, > int len)). this is bug, because it is widely used and it is main letter like > A, B or C in english :) and it is exist in all russian cyrillic's encoding > (koi8-r, iso-8859-5, windows-1251, cp866). > for example, in russian, words 'all', 'hedgehog', 'Christmas-tree' and many > other must be written with it. > > here is the patch for add it to ISO-8859-5 to mule internal code translation > table. i am don't know is this ok and do not brake any internal rule or > code ? You'd need to modify the mic->ISO-8859-5 translation table as well, for converting in the other direction. > By the way, as i can understand you are using koi8-r encoding for internal > representation of cyrillic charsets - this is have also another problem. the > second "widely" used char is <U2116> NUMERO SIGN (many accountants and > managers use it :) in cyrillic windows world) and it is exist in > windows-1251, cp866 and iso-8859-5 encoding, but not in koi8-r... Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an intermediate encoding, because there's no direct conversion table between ISO-8859-5 and the other cyrillic encodings. Ideally there would be. Another possibility would be to use UTF-8 as the intermediate encoding; that'd probably be much slower, but UTF-8 should have all the characters needed. Is there any other characters like "YO" that are missing, that exist in all the encodings? Looking at the character set table for KOI8-R, it looks like the "YO" is in an odd place in the table, compared to all other cyrillic characters. Perhaps that's why it was missed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
От
"Heikki Linnakangas"
Дата:
Heikki Linnakangas wrote: > Sergey Burladyan wrote: >> src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c >> does not have cyrillic letter 'IO' in ISO-8859-5 to mule internal code >> translation table (function iso2mic(const unsigned char *l, unsigned >> char *p, int len)). this is bug, because it is widely used and it is >> main letter like A, B or C in english :) and it is exist in all >> russian cyrillic's encoding (koi8-r, iso-8859-5, windows-1251, cp866). >> for example, in russian, words 'all', 'hedgehog', 'Christmas-tree' and >> many other must be written with it. >> >> here is the patch for add it to ISO-8859-5 to mule internal code >> translation table. i am don't know is this ok and do not brake any >> internal rule or code ? > > You'd need to modify the mic->ISO-8859-5 translation table as well, for > converting in the other direction. Here's a patch that does the conversion in the other direction as well. As I'm not too familiar with cyrillic, can you double-check that this works? I tested it using the convert() function between different encodings, and it seems ok to me. >> By the way, as i can understand you are using koi8-r encoding for >> internal representation of cyrillic charsets - this is have also >> another problem. the second "widely" used char is <U2116> NUMERO SIGN >> (many accountants and managers use it :) in cyrillic windows world) >> and it is exist in windows-1251, cp866 and iso-8859-5 encoding, but >> not in koi8-r... > > Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an > intermediate encoding, because there's no direct conversion table > between ISO-8859-5 and the other cyrillic encodings. Ideally there would > be. Another possibility would be to use UTF-8 as the intermediate > encoding; that'd probably be much slower, but UTF-8 should have all the > characters needed. > > Is there any other characters like "YO" that are missing, that exist in > all the encodings? Looking at the character set table for KOI8-R, it > looks like the "YO" is in an odd place in the table, compared to all > other cyrillic characters. Perhaps that's why it was missed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c,v retrieving revision 1.16 diff -c -r1.16 cyrillic_and_mic.c *** src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c 1 Jan 2008 19:45:53 -0000 1.16 --- src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c 19 Mar 2008 21:04:40 -0000 *************** *** 483,489 **** 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xe1, 0xe2, 0xf7, 0xe7, 0xe4, 0xe5, 0xf6, 0xfa, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, 0xf0, --- 483,489 ---- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0xb3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xe1, 0xe2, 0xf7, 0xe7, 0xe4, 0xe5, 0xf6, 0xfa, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, 0xf0, *************** *** 493,499 **** 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, 0xd0, 0xd2, 0xd3, 0xd4, 0xd5, 0xc6, 0xc8, 0xc3, 0xde, 0xdb, 0xdd, 0xdf, 0xd9, 0xd8, 0xdc, 0xc0, 0xd1, ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }; --- 493,499 ---- 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, 0xd0, 0xd2, 0xd3, 0xd4, 0xd5, 0xc6, 0xc8, 0xc3, 0xde, 0xdb, 0xdd, 0xdf, 0xd9, 0xd8, 0xdc, 0xc0, 0xd1, ! 0x00, 0xa3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }; *************** *** 509,517 **** 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xee, 0xd0, 0xd1, 0xe6, 0xd4, 0xd5, 0xe4, 0xd3, 0xe5, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, --- 509,517 ---- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0xf1, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0x00, 0xa1, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xee, 0xd0, 0xd1, 0xe6, 0xd4, 0xd5, 0xe4, 0xd3, 0xe5, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde,
Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
От
Sergey Burladyan
Дата:
Thursday 20 March 2008 01:16:34 Heikki Linnakangas: Thanks for answer, Heikki ! > You'd need to modify the mic->ISO-8859-5 translation table as well, for > converting in the other direction. oops, i have not thought about it %) > Here's a patch that does the conversion in the other direction as well. > As I'm not too familiar with cyrillic, can you double-check that this > works? I tested it using the convert() function between different > encodings, and it seems ok to me. yes, i test it with function like this and it work now :) create or replace function test_convert() returns setof record as $$ declare --- russian alphabet, 33 upper and 33 lower letters in utf-8 encoding r bytea default=20 E'\320\260\320\261\320\262\320\263\320\264\320\265\321\221\320\266\320\267\= 320\270\320\271\320\272\320\273\320\274\320\275\320\276\320\277\321\200\321= \201\321\202\321\203\321\204\321\205\321\206\321\207\321\210\321\211\321\21= 2\321\213\321\214\321\215\321\216\321\217\320\220\320\221\320\222\320\223\3= 20\224\320\225\320\201\320\226\320\227\320\230\320\231\320\232\320\233\320\= 234\320\235\320\236\320\237\320\240\320\241\320\242\320\243\320\244\320\245= \320\246\320\247\320\250\320\251\320\252\320\253\320\254\320\255\320\256\32= 0\257'; s bytea; --- converted to result t bytea; --- converted back result res record; begin raise notice 'russian ABC: "%"', encode(r, 'escape'); s :=3D convert(r, 'utf-8', 'iso-8859-5'); t :=3D convert(s, 'iso-8859-5', 'windows-1251'); t :=3D=20 convert(t, 'windows-1251', 'utf-8'); if t !=3D r then raise exception 'iso-8859-5, windows-1251 | t !=3D r'; end if; res :=3D row('iso-8859-5, windows-1251'::text, encode( =20=20=20=20=20=20 convert(convert(s, 'iso-8859-5', 'windows-1251'), 'windows-1251', 'utf-8') , 'escape')::text ); return next res; [...skip...] seb=3D# select * from test_convert() as (conv text, res text); NOTICE: russian ABC: "=D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0=B6=D0= =B7..." conv | res ----------------------------+----------- iso-8859-5, windows-1251 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... iso-8859-5, windows-866 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... iso-8859-5, koi8-r | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... iso-8859-5, iso-8859-5 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-866, windows-1251 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-866, iso-8859-5 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-866, koi8-r | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-866, windows-866 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-1251, windows-866 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-1251, iso-8859-5 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-1251, koi8-r | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... windows-1251, windows-1251 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... koi8-r, windows-866 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... koi8-r, iso-8859-5 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... koi8-r, windows-1251 | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... koi8-r, koi8-r | =D0=B0=D0=B1=D0=B2=D0=B3=D0=B4=D0=B5=D1=91=D0= =B6=D0=B7... (16 rows) > Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an > intermediate encoding, because there's no direct conversion table > between ISO-8859-5 and the other cyrillic encodings. Ideally there would > be. Another possibility would be to use UTF-8 as the intermediate > encoding; that'd probably be much slower, but UTF-8 should have all the > characters needed. I think that UTF-8 is too complex for translate 8-bit charset to another 8-= bit=20 charset, but other solution is many many translate tables... hard question = %) > Is there any other characters like "YO" that are missing, that exist in > all the encodings?=20 if we say about alphabet letters, the answer is - No, only "YO" was missing. if we say about any character, there is 'NO-BREAK SPACE' (U+00A0) it exist = in=20 1251, 866, koi8-r and iso but i do not think that it widely used... > Looking at the character set table for KOI8-R, it=20 > looks like the "YO" is in an odd place in the table, compared to all > other cyrillic characters. Perhaps that's why it was missed. Yes, i understand. russian character sets always been a challenge for all= =20 programmers :) it are at least five, and it are all different Thanks for patch, Heikki ! ---
Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
От
"Heikki Linnakangas"
Дата:
Sergey Burladyan wrote: > Thursday 20 March 2008 01:16:34 Heikki Linnakangas: >> Here's a patch that does the conversion in the other direction as well. >> As I'm not too familiar with cyrillic, can you double-check that this >> works? I tested it using the convert() function between different >> encodings, and it seems ok to me. > > yes, i test it with function like this and it work now :) Ok, patch applied. >> Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an >> intermediate encoding, because there's no direct conversion table >> between ISO-8859-5 and the other cyrillic encodings. Ideally there would >> be. Another possibility would be to use UTF-8 as the intermediate >> encoding; that'd probably be much slower, but UTF-8 should have all the >> characters needed. > I think that UTF-8 is too complex for translate 8-bit charset to another 8-bit > charset, but other solution is many many translate tables... hard question %) Yeah. It's probably not worth the effort to change/test it. Apparently there's not many people using these conversion functions, as the bug has been there since 7.3 and you're the first one to notice. >> Is there any other characters like "YO" that are missing, that exist in >> all the encodings? > if we say about alphabet letters, the answer is - No, only "YO" was missing. > if we say about any character, there is 'NO-BREAK SPACE' (U+00A0) it exist in > 1251, 866, koi8-r and iso but i do not think that it widely used... Ok, good. Thanks for the report and the patch! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com