Re: BUG #13440: unaccent does not remove all diacritics
От | Léonard Benedetti |
---|---|
Тема | Re: BUG #13440: unaccent does not remove all diacritics |
Дата | |
Msg-id | 56BCF7A5.6020204@mlpo.fr обсуждение исходный текст |
Ответ на | Re: BUG #13440: unaccent does not remove all diacritics (Teodor Sigaev <teodor@sigaev.ru>) |
Ответы |
Re: BUG #13440: unaccent does not remove all diacritics
|
Список | pgsql-bugs |
TL;DR: Special cases that were not handled by the new script were added. All characters handled by unaccent are now handled by the script, and as well new ones. 26/01/2016 00:44, Thomas Munro wrote: > Wow. It would indeed be nice to use this dataset rather than > maintaining the special cases for œ et al. It would also nice to pick > up all those other things like ©, ½, …, ≪, ≫ (though these stray a > little bit further from the functionality implied by unaccent's name). It is true that the file grows in size and offers more and more characters. But as Alvaro Herrera said in a previous mail: “To me, conceptually what unaccent does is turn whatever junk you have into a very basic common alphabet (ascii); then it's very easy to do full text searches without having to worry about what accents the people did or did not use in their searches.” and I think it makes sense. And since there is no significant performance difference, I think we can continue on this way. > I don't think this alone will completely get rid of the hardcoded > special cases though, because we have these two mappings which look > like Latin but are in fact Cyrillic and I assume we need to keep them: > > Ё Е > ё е Regarding Cyrillic characters mentioned, I did not noticed. But yes, we have to keep them (see Teodor Sigaev's message below). Furthermore, I continued my research to see which characters was not handled yet, and they are potentially multiple and it is not always clear whether they should be. In particular, I found several characters in “Letterlike Symbols” Unicode Block (U+2100 to U+214F) who were absent from the transliterator (℃, ℉, etc.). So I changed the script to handle special cases, and I added those I just mentioned (you will find attached the new version of the script and the generated output for convenience). > > Should we extend the composition data analysis to make these remaining > special cases go away? We'd need a definition of is_plain_letter that > returns True for 0415 so that 0401 can be recognised as 0415 + 0308. > Depending on how you do that, you could sweep in some more Cyrillic > mappings and a ton of stuff from other scripts that have precomposed > diacritic codepoints (Greek, Hebrew, Arabic, ...?), and we'd need > someone with knowledge of relevant languages to sign off on the result > -- so it might make sense to stick to a definition that includes just > Latin and Cyrillic for now. > > (Otherwise it might be tempting to use *only* the transliterator > approach, but CLDR doesn't seem to have appropriate transliterator > files for other scripts. They have for example Cyrillic -> Latin, but > we'd want Cyrillic -> some-subset-of-Cyrillic, analogous to Latin -> > ASCII.) > Indeed, I added some special cases, but I doubt very much that it is exhaustive. It would be good to find a way to avoid these cases. Regarding the various solutions proposed, it may be possible to opt for a hybrid one. For example, extend the analysis of the composition for blocks when relevant (some characters mentioned above show that some are not in transliterators), or use a transliterator when it's more convenient (perhaps for Cyrillic, etc.). You also right about to the fact that sometimes we must think for some languages (and so we'd need someone with knowledge of these languages), this is also true for some blocks for which we must decide whether to include certain characters makes sense or not. I think, notably, about the extended Latin blocks (Latin Extended-A, B, Additional, C, D, etc.) which are yet ignored. 11/02/2016 16:36, Teodor Sigaev wrote: >> I don't think this alone will completely get rid of the hardcoded >> special cases though, because we have these two mappings which look >> like Latin but are in fact Cyrillic and I assume we need to keep them: >> >> Ё Е >> ё е >> > As a native Russian speaker I can explain why we need to keep this two > rules. > 'Ё' letter is not 'E' with some accent/diacritic sign, it is a > separate letter in russian alphabet. But a lot of newpapers, magazines > and even books use 'Е' instead of 'Ё' to simplify printing house work. > Any Russian speaker doesn't make a mistake while reading because 'Ё' > isn't frequent and anybody remembers the right pronounce. Also, on > russian keyboard 'Ё' placed in inconvenient place (key with ` or ~), > so, many russian writer use 'Е' instead of it to increase typing speed. > > Pls, do not remove at least this special case. > This case is now managed as a special case in the new version (see above). Léonard Benedetti
Вложения
В списке pgsql-bugs по дате отправления:
Следующее
От: Léonard BenedettiДата:
Сообщение: Re: BUG #13440: unaccent does not remove all diacritics