Re: Illegal SJIS mapping

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: Illegal SJIS mapping
Дата
Msg-id 20161018.131042.13229590.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: Illegal SJIS mapping  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
Hello,

At Fri, 7 Oct 2016 23:58:45 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<9c544547-7214-aebe-9b04-57624aedde96@iki.fi>
> > So, I wonder how the mappings related to SJIS (and/or EUC-JP) are
> > maintained. If no authoritative information is available, the
> > generating script no longer usable. If any other autority is
> > choosed, it is to be modified according to whatever the new
> > source format is.
> 
> The script is clearly intended to read CP932.TXT, rather than
> SHIFTJIS.TXT, despite the comments in it. CP932.TXT can be found at
> 
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
> 
> However, running the script with that doesn't produce exactly what we
> have in utf8_to_sjis.map, either. It's otherwise same, but we have
> some extra mappings:
> 
> -  {0xc2a5, 0x5c},
> -  {0xc2ac, 0x81ca},
> -  {0xe28096, 0x8161},
> -  {0xe280be, 0x7e},
> -  {0xe28892, 0x817c},
> -  {0xe3809c, 0x8160},
> 
> Those mappings were added in commit
> a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus
> mapping for the invalid 0xc19c UTF-8 byte sequence was also added by
> that commit, as well a few valid mappings that UCS_to_SJIS.pl also
> produces.
> 
> I can't judge if those mappings make sense. If we can't find an
> authoritative source for them, I suggest that we leave them as they

The mappings have a hystorical reason came from differences
between Unicode definition and Oracle and Microsoft
implementations and developing of Unicode specification. So the
several SJIS (and EUC-JP) characters have two or more mappings to
Unicode. There's also several variations of the opposite
mapping. But none of them is the autority and what to adopt
depends on system requirement. The only requirement that
PostgreSQL should keep seems to be round-trip consistency starts
from SJIS input.

> are, but also hard-code them to UCS_to_SJIS.pl, so that running that
> script produces those mappings in utf8_to_sjis.map, even though they
> are not present in the CP932.TXT source file.

Agreed. I do that at least for Japanese charsets.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: Add PGDLLEXPORT to PG_FUNCTION_INFO_V1
Следующее
От: Tatsuo Ishii
Дата:
Сообщение: Re: Illegal SJIS mapping