Обсуждение: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP
Hi, I was wondering if this was considered a bug, and if so what were the plans= to fix it: http://archives.postgresql.org/pgsql-bugs/2005-08/msg00211.php I searched the: pgsql-bug archive and found nothing I also searched the wiki to do list and found nothing But I could have missed it. Sincerely, Kasia
> Hi, > I was wondering if this was considered a bug, and if so what were the plans to fix it: http://archives.postgresql.org/pgsql-bugs/2005-08/msg00211.php > > I searched the: pgsql-bug archive and found nothing > I also searched the wiki to do list and found nothing > But I could have missed it. I don't consider it's a bug. We maps "WAVE DASH" of EUC-JP (0xa1c1) to U+FF5E, not U+301C. U+FF5E and U+301C look same, but there are different code point by some reason I don't know. On the other hand EUC-JP has only one code point for WAVE DASH. So if we want to do a round trip conversion between EUC-JP and UTF-8, we have to choose either U+FF5E OR U+301C. We have chosen U+FF5E. If we change the mapping, many existing applications would break. Same thing can be said to MINUS sign. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP
От
Itagaki Takahiro
Дата:
On Wed, Mar 23, 2011 at 08:05, Kasia Tuszynska <ktuszynska@esri.com> wrote: > I was wondering if this was considered a bug, and if so what were the plans > to fix it: http://archives.postgresql.org/pgsql-bugs/2005-08/msg00211.php The wave dash issue is not postgres-specific; some other converter just replace it with '?'. Instead, postgres throws an error. I guess there is no possibility to support ambiguous character mappings in the default conversions, but you can define more relaxed conversion procedures for your purpose. BTW, we cannot use non-default conversion procedures from SQL commands, right? If it were allowed, we can use some "relaxed" conversions on the initial loading, like this: =# SET character_conversion TO utf8_to_eucjp_relaxed; =# COPY tbl FROM '/file_with_wave_dashes.utf8.tsv'; =# RESET character_conversion; Another idea is to allow to create new encoding names and define the above conversion procs as the default: =# CREATE ENCODING eucjp_relaxed; =# CREATE DEFAULT CONVERSION xxx FOR utf8 TO eucjp_relaxed FROM utf8_to_eucjp_relaxed; I think overhaul of conversion support is a TODO item. -- Itagaki Takahiro
Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP
От
Itagaki Takahiro
Дата:
On Wed, Mar 23, 2011 at 10:58, Tatsuo Ishii <ishii@postgresql.org> wrote: > So if we want to do a round trip conversion between > EUC-JP and UTF-8, we have to choose either U+FF5E OR U+301C. We have > chosen U+FF5E. If we change the mapping, many existing applications > would break. I heard a request a few times for an additional one-directional conversion from U+301C to EUC-JP (0xa1c1). It should not break existing applications. We already have non-round trip conversions for IBM and NEC extended characters in SJIS. The policy seems not so strict for me. Anyway, we might need to revisit the area in the near term for unicode Emoji issue. -- Itagaki Takahiro
>> So if we want to do a round trip conversion between >> EUC-JP and UTF-8, we have to choose either U+FF5E OR U+301C. We have >> chosen U+FF5E. If we change the mapping, many existing applications >> would break. > > I heard a request a few times for an additional one-directional conversion > from U+301C to EUC-JP (0xa1c1). It should not break existing applications. > We already have non-round trip conversions for IBM and NEC extended > characters in SJIS. The policy seems not so strict for me. Doesn't breaking round-trip conversion between EUC-JP and UTF-8 itself break backward compatibility? I think what we can do best here is, adding new encoding and default conversion. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP
От
Itagaki Takahiro
Дата:
On Wed, Mar 23, 2011 at 13:02, Tatsuo Ishii <ishii@postgresql.org> wrote: > I think what we can do best here is, adding new encoding and default > conversion. Agreed if the encoding is added as an user-defined encoding. I don't want to add built-in encodings only for Japanese language any more. -- Itagaki Takahiro
> Agreed if the encoding is added as an user-defined encoding. > I don't want to add built-in encodings only for Japanese language any more. I do not agree here. Adding one more encoding/conversion is not big deal. Anyway these soltions would come to be real after one or two releases at the earliest. The realistic solution available today is replacing default conversion for EUC-JP and UTF-8. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP
От
Kasia Tuszynska
Дата:
Hi,=20 We have a customer in Japan who would be interested in this fix, in the fut= ure. Would you like me to enter it as an official Postgres bug? Sincerely, Kasia=20 -----Original Message----- From: Tatsuo Ishii [mailto:ishii@postgresql.org]=20 Sent: Tuesday, March 22, 2011 10:17 PM To: itagaki.takahiro@gmail.come=20 Cc: Kasia Tuszynska; pgsql-bugs@postgresql.org Subject: Re: [BUGS] ERROR: character 0xe3809c of encoding "UTF8" has no equ= ivalent in EUC_JP > Agreed if the encoding is added as an user-defined encoding. > I don't want to add built-in encodings only for Japanese language any mor= e. I do not agree here. Adding one more encoding/conversion is not big deal. Anyway these soltions would come to be real after one or two releases at the earliest. The realistic solution available today is replacing default conversion for EUC-JP and UTF-8. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP
От
Itagaki Takahiro
Дата:
On Fri, Mar 25, 2011 at 03:33, Kasia Tuszynska <ktuszynska@esri.com> wrote: > We have a customer in Japan who would be interested in this fix, in the future. Would you like me to enter it as an officialPostgres bug? Not a bug at all -- there are at least 3 versions of "EUCJP" encodings, and postgres just supports one of them. I think it won't be changed in the near term. So, you would need to define a CONVERSION for your purpose as of now. However, I think we could have an extension of conversion procedure set for Japanese confused encodings out of the core. -- Itagaki Takahiro
> We have a customer in Japan who would be interested in this fix, in the future. Would you like me to enter it as an officialPostgres bug? > Sincerely, As I stated before, I don't regard this as a bug. BTW I wonder why you don't use CREATE CONVERSION which can be used for customer's problem today... -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp