Re: BUG #13440: unaccent does not remove all diacritics

Поиск

Список

Период

Сортировка

От	Thomas Munro
Тема	Re: BUG #13440: unaccent does not remove all diacritics
Дата	26 января 2016 г. 02:45:03
Msg-id	CAEepm=3Th+3XRiOoXewLvL1DybCbKxjc0FE4o6XqaZZBLUSOvg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: BUG #13440: unaccent does not remove all diacritics (Léonard Benedetti <benedetti@mlpo.fr>)
Ответы	Re: BUG #13440: unaccent does not remove all diacritics
Список	pgsql-bugs

Дерево обсуждения

On Sun, Jan 24, 2016 at 4:18 PM, L=C3=A9onard Benedetti <benedetti@mlpo.fr>=
 wrote:
> I use "unaccent" and I am very pleased with the applied patches for the
> default rules and the Python script to generate them.
>
> But as you pointed out, the "extra cases" (the subset of characters
> which is not generated by the script, but hardcoded) are pretty
> disturbing. The main problem to me is that it lacks a number of "extra
> cases". In fact, the script manages arbitrarily few ligatures but leaves
> many things aside. So I looked for a way to improve the generation, to
> avoid having this trouble.
>
> As you said, some characters don't have Unicode decomposition. So, to
> handle all these cases, we can use the standard Unicode transliterator
> Latin-ASCII (available in CLDR), it associates Unicode characters to
> ASCII-range equivalent. This approach seems much more elegant, this
> avoids hardcoded cases and transliterations are semantically correct (at
> least, as much as they can).

Wow.  It would indeed be nice to use this dataset rather than
maintaining the special cases for =C5=93 et al.  It would also nice to pick
up all those other things like =C2=A9, =C2=BD, =E2=80=A6, =E2=89=AA, =E2=89=
=AB (though these stray a
little bit further from the functionality implied by unaccent's name).
I don't think this alone will completely get rid of the hardcoded
special cases though, because we have these two mappings which look
like Latin but are in fact Cyrillic and I assume we need to keep them:

=D0=81 =D0=95
=D1=91 =D0=B5

Should we extend the composition data analysis to make these remaining
special cases go away?  We'd need a definition of is_plain_letter that
returns True for 0415 so that 0401 can be recognised as 0415 + 0308.
Depending on how you do that, you could sweep in some more Cyrillic
mappings and a ton of stuff from other scripts that have precomposed
diacritic codepoints (Greek, Hebrew, Arabic, ...?), and we'd need
someone with knowledge of relevant languages to sign off on the result
-- so it might make sense to stick to a definition that includes just
Latin and Cyrillic for now.

(Otherwise it might be tempting to use *only* the transliterator
approach, but CLDR doesn't seem to have appropriate transliterator
files for other scripts.  They have for example Cyrillic -> Latin, but
we'd want Cyrillic -> some-subset-of-Cyrillic, analogous to Latin ->
ASCII.)

> So, I modified the script: the arguments of the command line are used to
> pass the file path of the transliterator (available as an XML file in
> Unicode Common Locale Data Repository), so you find attached the new
> script and the generated output for convenience, I will also propose a
> patch for Commitfest. Note that the script now takes (at most) two input
> files: UnicodeData.txt and (optionally) the XML file of the transliterato=
r.
>
> By the way, I took the opportunity to make the script more user-friendly
> by several surface changes. There is now a very light support for
> command line arguments with help messages. The text file was, before,
> passed to the script on standard input; this approach is not appropriate
> when two files must be used. So as I mentioned, the arguments of the
> command line are now used to pass the paths.
>
> Finally, the use of this transliterator increase inevitably the number
> of characters handled, I do not think it's a problem (there is 1044
> characters handled), on the contrary, and after several tests on index
> generations, I have no significant performance difference. Nonetheless,
> using the transliterator remains optional and a command line option is
> available to disable it (so one can easily generate a small rules file,
> if desired). It seemed however logical to me to keep it on by default:
> that is, a priori, the desired behavior.

+1

--=20
Thomas Munro
http://www.enterprisedb.com

В списке pgsql-bugs по дате отправления:

Предыдущее

От: Peter Geoghegan
Дата: 26 января 2016 г., 01:42:11
Сообщение: Re: BUG #13886: When INSERT ON CONFLICT DO UPDATE updates, it returns INSERT rather than UPDATE

Следующее

От: Pavel Stehule
Дата: 26 января 2016 г., 10:28:33
Сообщение: Re: BUG #13889: psql doesn't exequte correct script

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #13440: unaccent does not remove all diacritics

Предыдущее

Следующее