Re: BUG #13440: unaccent does not remove all diacritics
От | Léonard Benedetti |
---|---|
Тема | Re: BUG #13440: unaccent does not remove all diacritics |
Дата | |
Msg-id | 56A4495C.8020705@mlpo.fr обсуждение исходный текст |
Ответ на | Re: BUG #13440: unaccent does not remove all diacritics (Léonard Benedetti <benedetti@mlpo.fr>) |
Список | pgsql-bugs |
24/01/2016 04:18, Léonard Benedetti wrotes : > Le 19/06/2015 04:00, Thomas Munro a écrit : >> On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I took a quick look at this list and it seems fairly sane as far as >>> the automatically-generated items go, except that I see it hits a few >>> LIGATURE cases (including the existing ij cases, but also fi fl and >>> ffl). I'm still quite dubious that that is appropriate; at least, if >>> we do it I think we should be expanding out to the equivalent >>> multi-letter form, not simply taking one of the letters and dropping >>> the rest. Anybody else have an opinion on how to handle ligatures? >> Here is a version that optionally expands ligatures if asked to with >> --expand-ligatures. It uses the Unicode 'general category' data to >> identify and strip diacritical marks and distinguish them from >> ligatures which are expanded to all their parts. It meant I had to >> load a bunch of stuff into memory up front, but this approach can >> handle an awkward bunch of ligatures whose component characters have >> marks: DŽ, Dž, dž -> DZ, Dz, dz. (These are considered to be single >> characters to maintain a one-to-one mapping with certain Cyrillic >> characters in some Balkan countries which use or used both scripts.) >> >> As for whether we *should* expand ligatures, I'm pretty sure that's >> what I'd always want, but my only direct experience of languages with >> ligatures as part of the orthography (rather than ligatures as a >> typesetting artefact like ffl et al) is French, where œ is used in the >> official spelling of a bunch of words like œil, sœur, cœur, œuvre when >> they appear in books, but substituting oe is acceptable on computers >> because neither the standard French keyboard nor the historically >> important Latin1 character set includes the character. I'm fairly >> sure the Dutch have a similar situation with IJ, it's completely >> interchangeable with the sequence IJ. >> >> So +1 from me for ligature expansion. It might be tempting to think >> that a function called 'unaccent' should only remove diacritical >> marks, but if we are going to be pedantic about it, not all >> diacritical marks are actually accents anyway... >> >>> The manually added special cases don't look any saner than they did >>> before :-(. Anybody have an objection to removing those (except maybe >>> dotless i) in HEAD? >> +1 from me for getting rid of the bogus œ->e, IJ -> I, ... transformations, but: >> >> 1. For some reason œ, æ (and uppercase equivalents) don't have >> combining character data in the Unicode file, so they still need to be >> treated as special cases if we're going to include ligatures. Their >> expansion should of course be oe and ae rather that what we have. >> 2. Likewise ß still needs special treatment (it may be historically >> composed of sz but Unicode doesn't know that, it's its own character >> now and expands to ss anyway). >> 3. I don't see any reason to drop the Afrikaans ʼn, though it should >> surely be expanded to 'n rather than n. >> 4. I have no clue about whether the single Cyrillic item in there >> belongs there. >> >> Just by the way, there are conventional rules for diacritic removal in >> some languages, like ä, ö, ü -> ae, oe, ue in German, å -> aa in >> Scandinavian languages and è -> e' in Italian. A German friend of >> mine has a ü in his last name and he finishes up with any of three >> possible spellings of his name on various official documents, credit >> cards etc as a result! But these sorts of things are specific to >> individual languages and don't belong in a general accent removal rule >> file (it would be inappropriate to convert French aigüe to aiguee or >> Spanish pingüino to pingueino). I guess speakers of those languages >> could consider submitting rules files for language-specific >> conventions like that. >> > I use "unaccent" and I am very pleased with the applied patches for the > default rules and the Python script to generate them. > > But as you pointed out, the "extra cases" (the subset of characters > which is not generated by the script, but hardcoded) are pretty > disturbing. The main problem to me is that it lacks a number of "extra > cases". In fact, the script manages arbitrarily few ligatures but leaves > many things aside. So I looked for a way to improve the generation, to > avoid having this trouble. > > As you said, some characters don't have Unicode decomposition. So, to > handle all these cases, we can use the standard Unicode transliterator > Latin-ASCII (available in CLDR), it associates Unicode characters to > ASCII-range equivalent. This approach seems much more elegant, this > avoids hardcoded cases and transliterations are semantically correct (at > least, as much as they can). > > So, I modified the script: the arguments of the command line are used to > pass the file path of the transliterator (available as an XML file in > Unicode Common Locale Data Repository), so you find attached the new > script and the generated output for convenience, I will also propose a > patch for Commitfest. Note that the script now takes (at most) two input > files: UnicodeData.txt and (optionally) the XML file of the transliterator. > > By the way, I took the opportunity to make the script more user-friendly > by several surface changes. There is now a very light support for > command line arguments with help messages. The text file was, before, > passed to the script on standard input; this approach is not appropriate > when two files must be used. So as I mentioned, the arguments of the > command line are now used to pass the paths. > > Finally, the use of this transliterator increase inevitably the number > of characters handled, I do not think it's a problem (there is 1044 > characters handled), on the contrary, and after several tests on index > generations, I have no significant performance difference. Nonetheless, > using the transliterator remains optional and a command line option is > available to disable it (so one can easily generate a small rules file, > if desired). It seemed however logical to me to keep it on by default: > that is, a priori, the desired behavior. > > Léonard Benedetti Here is the patch, attached. Léonard Benedetti
Вложения
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Léonard BenedettiДата:
Сообщение: Re: BUG #13440: unaccent does not remove all diacritics
Следующее
От: rwestlun@gmail.comДата:
Сообщение: BUG #13886: When INSERT ON CONFLICT DO UPDATE updates, it returns INSERT rather than UPDATE