Re: collate not support Unicode Variation Selector

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: collate not support Unicode Variation Selector
Дата
Msg-id 20220803.152532.853689765714975476.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: collate not support Unicode Variation Selector  (Thomas Munro <thomas.munro@gmail.com>)
Ответы RE: collate not support Unicode Variation Selector  (荒井元成 <n2029@ndensan.co.jp>)
Список pgsql-hackers
At Wed, 3 Aug 2022 14:02:08 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Maybe it would help if you run the strings through normalize() first?
> > I'm not sure if that can combine combining characters.
> 
> I think the similarity between Latin combining characters and these
> ideographic variations might end there.  I don't think there is a
> single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é.

Right. At least in Japanese texts, the two "character"s are the same
glyph.  In that sense the loss of variation selectors from a text
doesn't alter its meaning and doesn't hurt correctness at all.
Ideographic variation is useful in special cases where their
ideographic identity is crucial.

> This system is for controlling small differences in rendering for the
> "same" character[1].  My computer doesn't even show the OP's example
> glyphs as different (to my eyes, at least; I can see on a random
> picture I found[2] that the one with the e0101 selector is supposed to
> have a ... what do you call that ... a tiny gap :-)).

They need variation-aware fonts and application support to render.  So
when even *I* see the two characters on Excel (which I believe doesn't
have that support by default), they would look exactly same.  In that
sense, my opinion on the behavior is that all ideographic variations
rather should be treated as the same character in searching in general
context. In other words, text matching should just drop variation
selectors as the default behavior.

ICU:Collator [1] has the notion of "collation strength" and I saw in
an article that only Colator::IDENTICAL among five alternatives makes
distinction between ideographic variations of a glyph.

> [1] http://www.unicode.org/reports/tr37/tr37-14.html
> [2] https://glyphwiki.org/wiki/u3436

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: automatically generating node support functions
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns