Re: Notes about fixing regexes and UTF-8 (yet again)

Поиск
Список
Период
Сортировка
От NISHIYAMA Tomoaki
Тема Re: Notes about fixing regexes and UTF-8 (yet again)
Дата
Msg-id E4F0A52A-AA30-40CB-86A4-D795AB33DC64@staff.kanazawa-u.ac.jp
обсуждение исходный текст
Ответ на Re: Notes about fixing regexes and UTF-8 (yet again)  (Andrew Dunstan <andrew@dunslane.net>)
Ответы Re: Notes about fixing regexes and UTF-8 (yet again)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
I don't believe it is valid to ignore CJK characters above U+20000.
If it is used for names, it will be stored in the database.
If the behaviour is different from characters below U+FFFF, you will
get a bug report in meanwhile.

see
CJK Extension B, C, and D
from
http://www.unicode.org/charts/

Also, there are some code points that could be regarded alphabet and numbers
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols

On the other hand, it is ok if processing of characters above U+10000 is very slow,
as far as properly processed, because it is considered rare.


On 2012/02/17, at 23:56, Andrew Dunstan wrote:

>
>
> On 02/17/2012 09:39 AM, Tom Lane wrote:
>> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>>> Here's a wild idea: keep the class of each codepoint in a hash table.
>>> Initialize it with all codepoints up to 0xFFFF. After that, whenever a
>>> string contains a character that's not in the hash table yet, query the
>>> class of that character, and add it to the hash table. Then recompile
>>> the whole regex and restart the matching engine.
>>> Recompiling is expensive, but if you cache the results for the session,
>>> it would probably be acceptable.
>> Dunno ... recompiling is so expensive that I can't see this being a win;
>> not to mention that it would require fundamental surgery on the regex
>> code.
>>
>> In the Tcl implementation, no codepoints above U+FFFF have any locale
>> properties (alpha/digit/punct/etc), period.  Personally I'd not have a
>> problem imposing the same limitation, so that dealing with stuff above
>> that range isn't really a consideration anyway.
>
>
> up to U+FFFF is the BMP which is described as containing "characters for almost all modern languages, and a large
numberof special characters." It seems very likely to be acceptable not to bother about the locale of code points in
thesupplementary planes. 
>
> See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptions of which sets of characters are involved.
>
>
> cheers
>
> andrew
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Notes about fixing regexes and UTF-8 (yet again)
Следующее
От: Peter Eisentraut
Дата:
Сообщение: pg_regress application_name