Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF
Дата
Msg-id 54FF7147.20204@iki.fi
обсуждение исходный текст
Ответ на Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF  (Arjen Nienhuis <a.g.nienhuis@gmail.com>)
Ответы Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-bugs
On 03/10/2015 11:21 PM, Arjen Nienhuis wrote:
> On 10 Mar 2015 22:33, "Heikki Linnakangas" <hlinnaka@iki.fi> wrote:
>>
>> On 03/09/2015 10:51 PM, a.g.nienhuis@gmail.com wrote:
>>>
>>> arjen=> select convert_to(chr(128512), 'GB18030');
>>>
>>> Actual output:
>>>
>>> ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x80 in encoding
> "UTF8"
>>> has no equivalent in encoding "GB18030"
>>>
>>> Expected output:
>>>
>>>    convert_to
>>> ------------
>>>    \x9439fc36
>>> (1 row)
>>
>>
>> Hmm, looks like our gb18030 <-> Unicode conversion table only contains
> the Unicode BMP plane. Unicode points above 0xffff are not included.
>>
>> If we added all the missing mappings as one to one mappings, like we've
> done for the BMP, that would bloat the table horribly. There are over 1
> million code points that are currently not mapped. Fortunately, the missing
> mappings are in linear ranges that would be fairly simple to handle in
> programmatically. See e.g.
> https://ssl.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html.
> Someone needs to write the code (I'm not volunteering myself).
>
> I can write a "uint32 UTF8toGB18030(uint32)" function, but I don't know
> where to put it in the code.

The mapping functions are in
src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c.
They currently just consult the mapping table. You'd need to modify them
to also check if the codepoint is in one of those linear ranges, and do
the mapping for those programmatically.

> Else I could also extend the map file. It would double in size if it only
> needs to include valid code points.

The current mapping table contains about 63000 mappings, but there are
over a million valid code points that need to be mapped. If you just add
every one-to-one mapping to the table, it's going to blow up in size to
over 8 MB. I don't think we want that, handling the ranges with linear
mappings programmatically makes a lot more sense.

- Heikki

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Arjen Nienhuis
Дата:
Сообщение: Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF
Следующее
От: Andrew Gierth
Дата:
Сообщение: Re: pg_dump search path issue