Re: BUG #13440: unaccent does not remove all diacritics
От | Thomas Munro |
---|---|
Тема | Re: BUG #13440: unaccent does not remove all diacritics |
Дата | |
Msg-id | CAEepm=3Th+3XRiOoXewLvL1DybCbKxjc0FE4o6XqaZZBLUSOvg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #13440: unaccent does not remove all diacritics (Léonard Benedetti <benedetti@mlpo.fr>) |
Ответы |
Re: BUG #13440: unaccent does not remove all diacritics
|
Список | pgsql-bugs |
On Sun, Jan 24, 2016 at 4:18 PM, L=C3=A9onard Benedetti <benedetti@mlpo.fr>= wrote: > I use "unaccent" and I am very pleased with the applied patches for the > default rules and the Python script to generate them. > > But as you pointed out, the "extra cases" (the subset of characters > which is not generated by the script, but hardcoded) are pretty > disturbing. The main problem to me is that it lacks a number of "extra > cases". In fact, the script manages arbitrarily few ligatures but leaves > many things aside. So I looked for a way to improve the generation, to > avoid having this trouble. > > As you said, some characters don't have Unicode decomposition. So, to > handle all these cases, we can use the standard Unicode transliterator > Latin-ASCII (available in CLDR), it associates Unicode characters to > ASCII-range equivalent. This approach seems much more elegant, this > avoids hardcoded cases and transliterations are semantically correct (at > least, as much as they can). Wow. It would indeed be nice to use this dataset rather than maintaining the special cases for =C5=93 et al. It would also nice to pick up all those other things like =C2=A9, =C2=BD, =E2=80=A6, =E2=89=AA, =E2=89= =AB (though these stray a little bit further from the functionality implied by unaccent's name). I don't think this alone will completely get rid of the hardcoded special cases though, because we have these two mappings which look like Latin but are in fact Cyrillic and I assume we need to keep them: =D0=81 =D0=95 =D1=91 =D0=B5 Should we extend the composition data analysis to make these remaining special cases go away? We'd need a definition of is_plain_letter that returns True for 0415 so that 0401 can be recognised as 0415 + 0308. Depending on how you do that, you could sweep in some more Cyrillic mappings and a ton of stuff from other scripts that have precomposed diacritic codepoints (Greek, Hebrew, Arabic, ...?), and we'd need someone with knowledge of relevant languages to sign off on the result -- so it might make sense to stick to a definition that includes just Latin and Cyrillic for now. (Otherwise it might be tempting to use *only* the transliterator approach, but CLDR doesn't seem to have appropriate transliterator files for other scripts. They have for example Cyrillic -> Latin, but we'd want Cyrillic -> some-subset-of-Cyrillic, analogous to Latin -> ASCII.) > So, I modified the script: the arguments of the command line are used to > pass the file path of the transliterator (available as an XML file in > Unicode Common Locale Data Repository), so you find attached the new > script and the generated output for convenience, I will also propose a > patch for Commitfest. Note that the script now takes (at most) two input > files: UnicodeData.txt and (optionally) the XML file of the transliterato= r. > > By the way, I took the opportunity to make the script more user-friendly > by several surface changes. There is now a very light support for > command line arguments with help messages. The text file was, before, > passed to the script on standard input; this approach is not appropriate > when two files must be used. So as I mentioned, the arguments of the > command line are now used to pass the paths. > > Finally, the use of this transliterator increase inevitably the number > of characters handled, I do not think it's a problem (there is 1044 > characters handled), on the contrary, and after several tests on index > generations, I have no significant performance difference. Nonetheless, > using the transliterator remains optional and a command line option is > available to disable it (so one can easily generate a small rules file, > if desired). It seemed however logical to me to keep it on by default: > that is, a priori, the desired behavior. +1 --=20 Thomas Munro http://www.enterprisedb.com
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Peter GeogheganДата:
Сообщение: Re: BUG #13886: When INSERT ON CONFLICT DO UPDATE updates, it returns INSERT rather than UPDATE