Re: BUG #13440: unaccent does not remove all diacritics

Поиск

Список

Период

Сортировка

От	Léonard Benedetti
Тема	Re: BUG #13440: unaccent does not remove all diacritics
Дата	24 января 2016 г. 03:47:53
Msg-id	56A4495C.8020705@mlpo.fr обсуждение исходный текст
Ответ на	Re: BUG #13440: unaccent does not remove all diacritics (Léonard Benedetti <benedetti@mlpo.fr>)
Список	pgsql-bugs

Дерево обсуждения

24/01/2016 04:18, Léonard Benedetti wrotes :
> Le 19/06/2015 04:00, Thomas Munro a écrit :
>> On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I took a quick look at this list and it seems fairly sane as far as
>>> the automatically-generated items go, except that I see it hits a few
>>> LIGATURE cases (including the existing ij cases, but also fi fl and
>>> ffl). I'm still quite dubious that that is appropriate; at least, if
>>> we do it I think we should be expanding out to the equivalent
>>> multi-letter form, not simply taking one of the letters and dropping
>>> the rest. Anybody else have an opinion on how to handle ligatures?
>> Here is a version that optionally expands ligatures if asked to with
>> --expand-ligatures.  It uses the Unicode 'general category' data to
>> identify and strip diacritical marks and distinguish them from
>> ligatures which are expanded to all their parts.  It meant I had to
>> load a bunch of stuff into memory up front, but this approach can
>> handle an awkward bunch of ligatures whose component characters have
>> marks: Ǆ, ǅ, ǆ -> DZ, Dz, dz.  (These are considered to be single
>> characters to maintain a one-to-one mapping with certain Cyrillic
>> characters in some Balkan countries which use or used both scripts.)
>>
>> As for whether we *should* expand ligatures, I'm pretty sure that's
>> what I'd always want, but my only direct experience of languages with
>> ligatures as part of the orthography (rather than ligatures as a
>> typesetting artefact like ﬄ et al) is French, where œ is used in the
>> official spelling of a bunch of words like œil, sœur, cœur, œuvre when
>> they appear in books, but substituting oe is acceptable on computers
>> because neither the standard French keyboard nor the historically
>> important Latin1 character set includes the character.  I'm fairly
>> sure the Dutch have a similar situation with Ĳ, it's completely
>> interchangeable with the sequence IJ.
>>
>> So +1 from me for ligature expansion.  It might be tempting to think
>> that a function called 'unaccent' should only remove diacritical
>> marks, but if we are going to be pedantic about it, not all
>> diacritical marks are actually accents anyway...
>>
>>> The manually added special cases don't look any saner than they did
>>> before :-(.  Anybody have an objection to removing those (except maybe
>>> dotless i) in HEAD?
>> +1 from me for getting rid of the bogus œ->e, Ĳ -> I, ... transformations, but:
>>
>> 1.  For some reason œ, æ (and uppercase equivalents) don't have
>> combining character data in the Unicode file, so they still need to be
>> treated as special cases if we're going to include ligatures.  Their
>> expansion should of course be oe and ae rather that what we have.
>> 2.  Likewise ß still needs special treatment (it may be historically
>> composed of sz but Unicode doesn't know that, it's its own character
>> now and expands to ss anyway).
>> 3.  I don't see any reason to drop the Afrikaans ŉ, though it should
>> surely be expanded to 'n rather than n.
>> 4.  I have no clue about whether the single Cyrillic item in there
>> belongs there.
>>
>> Just by the way, there are conventional rules for diacritic removal in
>> some languages, like ä, ö, ü -> ae, oe, ue in German, å -> aa in
>> Scandinavian languages and è -> e' in Italian.  A German friend of
>> mine has a ü in his last name and he finishes up with any of three
>> possible spellings of his name on various official documents, credit
>> cards etc as a result!  But these sorts of things are specific to
>> individual languages and don't belong in a general accent removal rule
>> file (it would be inappropriate to convert French aigüe to aiguee or
>> Spanish pingüino to pingueino).  I guess speakers of those languages
>> could consider submitting rules files for language-specific
>> conventions like that.
>>
> I use "unaccent" and I am very pleased with the applied patches for the
> default rules and the Python script to generate them.
>
> But as you pointed out, the "extra cases" (the subset of characters
> which is not generated by the script, but hardcoded) are pretty
> disturbing. The main problem to me is that it lacks a number of "extra
> cases". In fact, the script manages arbitrarily few ligatures but leaves
> many things aside. So I looked for a way to improve the generation, to
> avoid having this trouble.
>
> As you said, some characters don't have Unicode decomposition. So, to
> handle all these cases, we can use the standard Unicode transliterator
> Latin-ASCII (available in CLDR), it associates Unicode characters to
> ASCII-range equivalent. This approach seems much more elegant, this
> avoids hardcoded cases and transliterations are semantically correct (at
> least, as much as they can).
>
> So, I modified the script: the arguments of the command line are used to
> pass the file path of the transliterator (available as an XML file in
> Unicode Common Locale Data Repository), so you find attached the new
> script and the generated output for convenience, I will also propose a
> patch for Commitfest. Note that the script now takes (at most) two input
> files: UnicodeData.txt and (optionally) the XML file of the transliterator.
>
> By the way, I took the opportunity to make the script more user-friendly
> by several surface changes. There is now a very light support for
> command line arguments with help messages. The text file was, before,
> passed to the script on standard input; this approach is not appropriate
> when two files must be used. So as I mentioned, the arguments of the
> command line are now used to pass the paths.
>
> Finally, the use of this transliterator increase inevitably the number
> of characters handled, I do not think it's a problem (there is 1044
> characters handled), on the contrary, and after several tests on index
> generations, I have no significant performance difference. Nonetheless,
> using the transliterator remains optional and a command line option is
> available to disable it (so one can easily generate a small rules file,
> if desired). It seemed however logical to me to keep it on by default:
> that is, a priori, the desired behavior.
>
> Léonard Benedetti
Here is the patch, attached.

Léonard Benedetti

Вложения

improve-unaccent-default-rules-generation-script.patch

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #13440: unaccent does not remove all diacritics

Вложения