Обсуждение: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

Поиск
Список
Период
Сортировка

ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Kasia Tuszynska
Дата:
Hi,
I was wondering if this was considered a bug, and if so what were the plans=
 to fix it: http://archives.postgresql.org/pgsql-bugs/2005-08/msg00211.php

I searched the: pgsql-bug archive and found nothing
I also searched the wiki to do list and found nothing
But I could have missed it.

Sincerely,
Kasia

Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Tatsuo Ishii
Дата:
> Hi,
> I was wondering if this was considered a bug, and if so what were the plans to fix it:
http://archives.postgresql.org/pgsql-bugs/2005-08/msg00211.php
>
> I searched the: pgsql-bug archive and found nothing
> I also searched the wiki to do list and found nothing
> But I could have missed it.

I don't consider it's a bug.

We maps "WAVE DASH" of EUC-JP (0xa1c1) to U+FF5E, not U+301C. U+FF5E
and U+301C look same, but there are different code point by some
reason I don't know. On the other hand EUC-JP has only one code point
for WAVE DASH. So if we want to do a round trip conversion between
EUC-JP and UTF-8, we have to choose either U+FF5E OR U+301C. We have
chosen U+FF5E. If we change the mapping, many existing applications
would break.

Same thing can be said to MINUS sign.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Itagaki Takahiro
Дата:
On Wed, Mar 23, 2011 at 08:05, Kasia Tuszynska <ktuszynska@esri.com> wrote:
> I was wondering if this was considered a bug, and if so what were the plans
> to fix it: http://archives.postgresql.org/pgsql-bugs/2005-08/msg00211.php

The wave dash issue is not postgres-specific; some other converter just
replace it with '?'. Instead, postgres throws an error.
I guess there is no possibility to support ambiguous character mappings
in the default conversions, but you can define more relaxed conversion
procedures for your purpose.


BTW, we cannot use non-default conversion procedures from SQL commands,
right?  If it were allowed, we can use some "relaxed" conversions
on the initial loading, like this:

=# SET character_conversion TO utf8_to_eucjp_relaxed;
=# COPY tbl FROM '/file_with_wave_dashes.utf8.tsv';
=# RESET character_conversion;

Another idea is to allow to create new encoding names and define
the above conversion procs as the default:

=# CREATE ENCODING eucjp_relaxed;
=# CREATE DEFAULT CONVERSION xxx FOR utf8 TO eucjp_relaxed
     FROM utf8_to_eucjp_relaxed;

I think overhaul of conversion support is a TODO item.

--
Itagaki Takahiro

Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Itagaki Takahiro
Дата:
On Wed, Mar 23, 2011 at 10:58, Tatsuo Ishii <ishii@postgresql.org> wrote:
> So if we want to do a round trip conversion between
> EUC-JP and UTF-8, we have to choose either U+FF5E OR U+301C. We have
> chosen U+FF5E. If we change the mapping, many existing applications
> would break.

I heard a request a few times for an additional one-directional conversion
from U+301C to EUC-JP (0xa1c1). It should not break existing applications.
We already have non-round trip conversions for IBM and NEC extended
characters in SJIS. The policy seems not so strict for me.

Anyway, we might need to revisit the area in the near term for unicode
Emoji issue.

--
Itagaki Takahiro

Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Tatsuo Ishii
Дата:
>> So if we want to do a round trip conversion between
>> EUC-JP and UTF-8, we have to choose either U+FF5E OR U+301C. We have
>> chosen U+FF5E. If we change the mapping, many existing applications
>> would break.
>
> I heard a request a few times for an additional one-directional conversion
> from U+301C to EUC-JP (0xa1c1). It should not break existing applications.
> We already have non-round trip conversions for IBM and NEC extended
> characters in SJIS. The policy seems not so strict for me.

Doesn't breaking round-trip conversion between EUC-JP and UTF-8 itself
break backward compatibility?

I think what we can do best here is, adding new encoding and default
conversion.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Itagaki Takahiro
Дата:
On Wed, Mar 23, 2011 at 13:02, Tatsuo Ishii <ishii@postgresql.org> wrote:
> I think what we can do best here is, adding new encoding and default
> conversion.

Agreed if the encoding is added as an user-defined encoding.
I don't want to add built-in encodings only for Japanese language any more.

--
Itagaki Takahiro

Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Tatsuo Ishii
Дата:
> Agreed if the encoding is added as an user-defined encoding.
> I don't want to add built-in encodings only for Japanese language any more.

I do not agree here. Adding one more encoding/conversion is not big
deal.

Anyway these soltions would come to be real after one or two releases
at the earliest. The realistic solution available today is replacing
default conversion for EUC-JP and UTF-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Kasia Tuszynska
Дата:
Hi,=20
We have a customer in Japan who would be interested in this fix, in the fut=
ure. Would you like me to enter it as an official Postgres bug?
Sincerely,
Kasia=20

-----Original Message-----
From: Tatsuo Ishii [mailto:ishii@postgresql.org]=20
Sent: Tuesday, March 22, 2011 10:17 PM
To: itagaki.takahiro@gmail.come=20
Cc: Kasia Tuszynska; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] ERROR: character 0xe3809c of encoding "UTF8" has no equ=
ivalent in EUC_JP

> Agreed if the encoding is added as an user-defined encoding.
> I don't want to add built-in encodings only for Japanese language any mor=
e.

I do not agree here. Adding one more encoding/conversion is not big
deal.

Anyway these soltions would come to be real after one or two releases
at the earliest. The realistic solution available today is replacing
default conversion for EUC-JP and UTF-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Itagaki Takahiro
Дата:
On Fri, Mar 25, 2011 at 03:33, Kasia Tuszynska <ktuszynska@esri.com> wrote:
> We have a customer in Japan who would be interested in this fix, in the future. Would you like me to enter it as an
officialPostgres bug? 

Not a bug at all -- there are at least 3 versions of "EUCJP" encodings, and
postgres just supports one of them. I think it won't be changed in the near
term. So, you would need to define a CONVERSION for your purpose as of now.

However, I think we could have an extension of conversion procedure set
for Japanese confused encodings out of the core.

--
Itagaki Takahiro

Re: ERROR: character 0xe3809c of encoding "UTF8" has no equivalent in EUC_JP

От
Tatsuo Ishii
Дата:
> We have a customer in Japan who would be interested in this fix, in the future. Would you like me to enter it as an
officialPostgres bug? 
> Sincerely,

As I stated before, I don't regard this as a bug.

BTW I wonder why you don't use CREATE CONVERSION which can be used for
customer's problem today...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp