Re: Patch for bug #12845 (GB18030 encoding)

Поиск
Список
Период
Сортировка
От Arjen Nienhuis
Тема Re: Patch for bug #12845 (GB18030 encoding)
Дата
Msg-id CAG6W84J+BJ0hEe1yrPL4bxVz-MaqCFdHkWRWVBiq8BaCoY8j3Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Patch for bug #12845 (GB18030 encoding)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Patch for bug #12845 (GB18030 encoding)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Arjen Nienhuis <a.g.nienhuis@gmail.com> writes:
>> GB18030 is a special case, because it's a full mapping of all unicode
>> characters, and most of it is algorithmically defined.
>
> True.
>
>> This makes UtfToLocal a bad choice to implement it.
>
> I disagree with that conclusion.  There are still 30000+ characters
> that need to be translated via lookup table, so we still need either
> UtfToLocal or a clone of it; and as I said previously, I'm not on board
> with cloning it.
>
>> I think the best solution is to get rid of UtfToLocal for GB18030. Use
>> a specialized algorithm:
>> - For characters > U+FFFF use the algorithm from my patch
>> - For charcaters <= U+FFFF use special mapping tables to map from/to
>> UTF32. Those tables would be smaller, and the code would be faster (I
>> assume).
>
> I looked at what wikipeda claims is the authoritative conversion table:
>
> http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
>
> According to that, about half of the characters below U+FFFF can be
> processed via linear conversions, so I think we ought to save table
> space by doing that.  However, the remaining stuff that has to be
> processed by lookup still contains a pretty substantial number of
> characters that map to 4-byte GB18030 characters, so I don't think
> we can get any table size savings by adopting a bespoke table format.
> We might as well use UtfToLocal.  (Worth noting in this connection
> is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
> table entries for other encodings, even though most of the others
> are not concerned with characters outside the BMP.)
>

It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal
uses a sparse array:

map = {{0, x}, {1, y}, {2, z}, ...}

v.s.

map = {x, y, z, ...}

That's fine when not every code point is used, but it's different for
GB18030 where almost all code points are used. Using a plain array
saves space and saves a binary search.

Gr. Arjen



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Changes to backup.sgml
Следующее
От: Robert Haas
Дата:
Сообщение: Re: i feel like compelled !