Re: BUG #13440: unaccent does not remove all diacritics

Поиск
Список
Период
Сортировка
От Emre Hasegeli
Тема Re: BUG #13440: unaccent does not remove all diacritics
Дата
Msg-id CAE2gYzxRa6wWWL1NS2e8+sjzdNKRu5tMs-AGMdo2wcmq6RfTDg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #13440: unaccent does not remove all diacritics  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-bugs
> To me, conceptually what unaccent does is turn whatever junk you have
> into a very basic common alphabet (ascii); then it's very easy to do
> full text searches without having to worry about what accents the people
> did or did not use in their searches.  If we say "okay, but that funny
> char is not an accent so let's leave it alone" then the charter doesn't
> sound so useful to me.

It is the same for me.  It is unfortunate that this module is named
as "unaccent".  There are many characters on the rule file that has
nothing do with accents.  They are normal letters on some alphabets
which are not in ASCII.  "replace-with-ascii" would be a better name
for it.

> The cases I care about are okay anyway, because all the funny chars in
> spanish are already covered; and maybe German people always enter their
> queries using the funny ss thing I can't even write, and then this is
> not a problem for them.

I am learning German only for a few months, and even I can confirm
that replacing "=C3=9F" with "s", or "=C3=BC" with "u" is wrong.  On the ot=
her
hand if they would be correctly replaced with "ss" and "ou", I would
be really unhappy because it is just too common in Turkish to press
"u" instead of "=C3=BC".

I think it is better for this module to replace those characters with
a single ASCII character that sounds similar.  With this point of
view I think is fine to replace "=C3=9F" with "s" even if it is obviously
wrong.  This module will never be useful for German without breaking
other usages, anyway.  We can try to cover as many characters as
possible keeping this in mind.

It would also be nice support other rules for real "unaccent", and
correct replacement for German.  Maybe we can add different rule
files to this module.

> Regarding back-patching unaccent.rules changes as discussed downthread,
> I think it's okay to simply document that any indexes using the module
> should be reindexed immediately after upgrading to that minor version.
> The consequence of not doing so is not *that* serious anyway.  But then,
> since I'm not actually affected in any way, I'm not strongly holding
> this position either.

I think it would cause more trouble than help, if we ever backpack
changes on this rules.

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: BUG #13440: unaccent does not remove all diacritics
Следующее
От: Christoph Berg
Дата:
Сообщение: Re: [GENERAL] pg_xlog on a hot_standby slave filling up