Re: BUG #15548: Unaccent does not remove combining diacritical characters

Поиск
Список
Период
Сортировка
От Hugh Ranalli
Тема Re: BUG #15548: Unaccent does not remove combining diacritical characters
Дата
Msg-id CAAhbUMNqJXTN+_vYdi5L4CLjoq9OCG29V597RKrCQ7xKsCAejA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #15548: Unaccent does not remove combining diacritical characters  (Hugh Ranalli <hugh@whtc.ca>)
Ответы Re: BUG #15548: Unaccent does not remove combining diacritical characters  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
I've attached a patch removes combining diacriticals. As with Latin and Greek letters, it uses ranges to restrict its activity. 

I have not submitted a patch for unaccent.rules, as it seems that a rules file generated from generate_unaccent_rules.py will actually remove a large number of rules (even before my changes), such as replacing the copyright symbol © with (C), as well as other accented characters. It's probably worth asking if the shipped unaccent.rules should correspond to what the shipped generation utility produces, or not. I was surprised to see that it didn't.

Please let me know if you see anything I need to change.

Best wishes,
Hugh

--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh@whtc.ca
c: +01-416-994-7957
w: www.whtc.ca


On Thu, 13 Dec 2018 at 13:50, Hugh Ranalli <hugh@whtc.ca> wrote:


On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel@manitou-mail.org wrote:
        Tom Lane wrote:

> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.

That's what I was thinking. Given that the accent is separate from the characters, simply dropping it should result in the correct unaccented character.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

 "Alternatively, if only one character is given on a line, instances
 of that character are deleted; this is useful in languages where
 accents are represented by separate characters"

Yes, I had read that in the docs, and that's the approach I planned to take. I'll go ahead and develop a patch, then.

Best wishes,
Hugh
Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Jean-Marc Lessard
Дата:
Сообщение: RE: BUG #15553: "ERROR: cache lookup failed for type 2" with afunction the first time it run.
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #15548: Unaccent does not remove combining diacritical characters