Re: [HACKERS] Extra Vietnamese unaccent rules

Поиск
Список
Период
Сортировка
От Dang Minh Huong
Тема Re: [HACKERS] Extra Vietnamese unaccent rules
Дата
Msg-id D367CC2F-5595-4370-827A-C439C0361979@gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Extra Vietnamese unaccent rules  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: [HACKERS] Extra Vietnamese unaccent rules
Список pgsql-hackers
Hi,

I am interested in this thread.

On May 27, 29 Heisei, at 10:41, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, May 26, 2017 at 5:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´".  The field "00E2 0301" is the decomposed form of
that character above.  Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line.  I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.

Doesn't that depend on the NF operation you are working on? With a
canonical decomposition it seems to me that a character with two
accents can as well be decomposed with one character and two composing
character accents (NFKC does a canonical decomposition in one of its
steps).

You don't have to worry about decoding that line, it's all done in
that Python script.  The problem is just in the function
is_letter_with_marks().  Instead of just checking if combining_ids[0]
is a plain letter, it looks like it should also check if
combining_ids[0] itself is a letter with marks.  Also get_plain_letter
would need to be able to recurse to extract the "a".


Thanks for reporting and lecture about unicode.
I attached a patch as the instruction from Thomas. Could you confirm it.

Actually, with the recent work that has been done with
unicode_norm_table.h which has been to transpose UnicodeData.txt into
user-friendly tables, shouldn't the python script of unaccent/ be
replaced by something that works on this table? This does a canonical
decomposition but just keeps the first characters with a class
ordering of 0. So we have basic APIs able to look at UnicodeData.txt
and let caller do decision making with the result returned.
--
Michael

Thanks, i will learning about it.

---
Dang Minh Huong
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Kirkwood
Дата:
Сообщение: Re: [HACKERS] logical replication - still unstable after all thesemonths
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: [HACKERS] Broken hint bits (freeze)