Re: Patch for bug #12845 (GB18030 encoding)

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Patch for bug #12845 (GB18030 encoding)
Дата
Msg-id 19727.1431699018@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Patch for bug #12845 (GB18030 encoding)  (Arjen Nienhuis <a.g.nienhuis@gmail.com>)
Ответы Re: Patch for bug #12845 (GB18030 encoding)  (Arjen Nienhuis <a.g.nienhuis@gmail.com>)
Список pgsql-hackers
Arjen Nienhuis <a.g.nienhuis@gmail.com> writes:
> GB18030 is a special case, because it's a full mapping of all unicode
> characters, and most of it is algorithmically defined.

True.

> This makes UtfToLocal a bad choice to implement it.

I disagree with that conclusion.  There are still 30000+ characters
that need to be translated via lookup table, so we still need either
UtfToLocal or a clone of it; and as I said previously, I'm not on board
with cloning it.

> I think the best solution is to get rid of UtfToLocal for GB18030. Use
> a specialized algorithm:
> - For characters > U+FFFF use the algorithm from my patch
> - For charcaters <= U+FFFF use special mapping tables to map from/to
> UTF32. Those tables would be smaller, and the code would be faster (I
> assume).

I looked at what wikipeda claims is the authoritative conversion table:

http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

According to that, about half of the characters below U+FFFF can be
processed via linear conversions, so I think we ought to save table
space by doing that.  However, the remaining stuff that has to be
processed by lookup still contains a pretty substantial number of
characters that map to 4-byte GB18030 characters, so I don't think
we can get any table size savings by adopting a bespoke table format.
We might as well use UtfToLocal.  (Worth noting in this connection
is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
table entries for other encodings, even though most of the others
are not concerned with characters outside the BMP.)
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: best place for "rtree" strategy numbers
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Changes to backup.sgml