Re: Patch for bug #12845 (GB18030 encoding)

Поиск
Список
Период
Сортировка
От Arjen Nienhuis
Тема Re: Patch for bug #12845 (GB18030 encoding)
Дата
Msg-id CAG6W84+FnfkFj2G_OXeLEnUWAoK2QGHNmy-72E_Yza7J-LfvAQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Patch for bug #12845 (GB18030 encoding)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Patch for bug #12845 (GB18030 encoding)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Thu, May 14, 2015 at 11:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera
>>> <alvherre@2ndquadrant.com> wrote:
>>>> Maybe not, but at the very least we should consider getting it fixed in
>>>> 9.5 rather than waiting a full development cycle.  Same as in
>>>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us
>>>> I'm not saying we MUST include it in 9.5, but we should at least
>>>> consider it.  If we simply stash it in the open CF we guarantee that it
>>>> will linger there for a year.
>
>>> Sure, if somebody has the time to put into it now, I'm fine with that.
>>> I'm afraid it won't be me, though: even if I had the time, I don't
>>> know enough about encodings.
>
>> I concur that we should at least consider this patch for 9.5.  I've
>> added it to
>> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items
>
> I looked at this patch a bit, and read up on GB18030 (thank you
> wikipedia).  I concur we have a problem to fix.  I do not like the way
> this patch went about it though, ie copying-and-pasting LocalToUtf and
> UtfToLocal and their supporting routines into utf8_and_gb18030.c.
> Aside from being duplicative, this means the improved mapping capability
> isn't available to use with anything except GB18030.  (I do not know
> whether there are any linear mapping ranges in other encodings, but
> seeing that the Unicode crowd went to the trouble of defining a notation
> for it in http://www.unicode.org/reports/tr22/, I'm betting there are.)
>
> What I think would be a better solution, if slightly more invasive,
> is to extend LocalToUtf and UtfToLocal to add a callback function
> argument for a function of signature "uint32 translate(uint32)".
> This function, if provided, would be called after failing to find a
> mapping in the mapping table(s), and it could implement any translation
> that would be better handled by code than as a boatload of mapping-table
> entries.  If it returns zero then it doesn't know a translation either,
> so throw error as before.
>
> An alternative definition that could be proposed would be to call the
> function before consulting the mapping tables, not after, on the grounds
> that the function can probably exit cheaply if the input's not in a range
> that it cares about.  However, consulting the mapping table first wins
> if you have ranges that mostly work but contain a few exceptions: put
> the exceptions in the mapping table and then the function need not worry
> about handling them.
>
> Another alternative approach would be to try to define linear mapping
> ranges in a tabular fashion, for more consistency with what's there now.
> But that probably wouldn't work terribly well because the bytewise
> character representations used in this logic have to be converted into
> code points before you can do any sort of linear mapping.  We could
> hard-wire that conversion for UTF8, but the conversion in the other code
> space would be encoding-specific.  So we might as well just treat the
> whole linear mapping behavior as a black box function for each encoding.
>
> I'm also discounting the possibility that someone would want an
> algorithmic mapping for cases involving "combined" codes (ie pairs of
> UTF8 characters).  Of the encodings we support, only EUC_JIS_2004 and
> SHIFT_JIS_2004 need such cases at all, and those have only a handful of
> cases; so it doesn't seem popular enough to justify the extra complexity.
>
> I also notice that pg_gb18030_verifier isn't even close to strict enough;
> it basically relies on pg_gb18030_mblen which contains no checks
> whatsoever on the third and fourth bytes.  So that needs to be fixed.
>
> The verification tightening would definitely not be something to
> back-patch, and I'm inclined to think that the additional mapping
> capability shouldn't be either, in view of the facts that (a) we've
> had few if any field complaints yet, and (b) changing the signatures
> of LocalToUtf/UtfToLocal might possibly break third-party code.
> So I'm seeing this as a HEAD-only patch, but I do want to try to
> squeeze it into 9.5 rather than wait another year.
>
> Barring objections, I'll go make this happen.

GB18030 is a special case, because it's a full mapping of all unicode
characters, and most of it is algorithmically defined. This makes
UtfToLocal a bad choice to implement it. UtfToLocal assumes a sparse
array with only the defined characters. It uses binary search to find
a character. The 2 tables it uses now are huge (the .so file is 1MB).
Adding the rest of the valid characters to this scheme is possible,
but would make the problem worse.

I think fixing UtfToLocal only for the new characters is not optimal.

I think the best solution is to get rid of UtfToLocal for GB18030. Use
a specialized algorithm:
- For characters > U+FFFF use the algorithm from my patch
- For charcaters <= U+FFFF use special mapping tables to map from/to
UTF32. Those tables would be smaller, and the code would be faster (I
assume).

For example (256 KB):

unsigned int utf32_to_gb18030[65536] = {
        /* 0x0 */ 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7,
        /* 0x8 */ 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf,
--
        /* 0xdb08 */ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
--
        /* 0xfff0 */ 0x8431a334, 0x8431a335, 0x8431a336, 0x8431a337,
0x8431a338, 0x8431a339, 0x8431a430, 0x8431a431,
        /* 0xfff8 */ 0x8431a432, 0x8431a433, 0x8431a434, 0x8431a435,
0x8431a436, 0x8431a437, 0x8431a438, 0x8431a439
};

Instead of (500KB):

static pg_utf_to_local ULmapGB18030[ 63360 ] = {
  {0xc280, 0x81308130},
  {0xc281, 0x81308131},
--
  {0xefbfbe, 0x8431a438},
  {0xefbfbf, 0x8431a439}
};


See the attachment for a python script to generate those mappings.

Gr. Arjen

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: Re: KNN-GiST with recheck
Следующее
От: Fujii Masao
Дата:
Сообщение: Re: Proposal : REINDEX xxx VERBOSE