Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

Поиск
Список
Период
Сортировка
От Ashutosh Sharma
Тема Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8
Дата
Msg-id CAE9k0P=u9DTeRNQajcBVbUF21d-6ufZf9Z5P08Gg6QiUCx=U7Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Список pgsql-hackers
On Fri, Oct 30, 2020 at 8:49 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in
> > Hi All,
> >
> > Today while working on some other task related to database encoding, I
> > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> > UTF-8. See below:
> >
> > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
> >  convert
> > ----------
> >  \xefbc8d
> > (1 row)
> >
> > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> > HYPHEN-MINUS SIGN.
>
> No it's not a bug, but a well-known "design":(
>
> The mapping is generated from CP932.TXT and JIS0212.TXT by
> UCS_to_UEC_JP.pl.
>
> CP932.TXT used here is here.
>
> https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
>
> CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows.
>
> 0x817C  0xFF0D  #FULLWIDTH HYPHEN-MINUS
>

We do have MINUS SIGN (U+2212) defined in both UTF-8 and EUC-JP
encoding. So, not sure why converting MINUS SIGN from UTF-8 to EUC-JP
should throw an error saying: "... in encoding UTF8 has *no*
equivalent in EUC_JP". I mean this information looks misleading and
that's I reason I feel its a bug.

> > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> > converted to EUC-JP, the convert functions fails with an error saying:
> > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> > equivalent in encoding EUC_JP". See below:
> >
> > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> > ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> > has no equivalent in encoding "EUC_JP"
>
> U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp
> U+2212(e2 88 92) doesn't have a mapping between euc-jp.
>
> > However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> > encoding, the convert function returns the correct result. See below:
> >
> > postgres=# select convert('\xe28892', 'utf-8', 'sjis');
> >  convert
> > ---------
> >  \x817c
> > (1 row)
>
> It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason
> but maybe because it was used widely.
>
> So ping-pong between Unicode and SJIS behaves like this:
>
> U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...
>
> > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > MINUS SIGN in SJIS and that is what we expect. Isn't it?
>
> I think we don't change authoritative mappings, but maybe can add some
> one-way conversions for the convenience.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: Disable WAL logging to speed up data loading
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8