Re: BUG #13440: unaccent does not remove all diacritics

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: BUG #13440: unaccent does not remove all diacritics
Дата
Msg-id CAEepm=24OUabfGQM4JLnNCtwu_fvhuH34ObKTbnsEj5oKLWL8w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #13440: unaccent does not remove all diacritics  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: BUG #13440: unaccent does not remove all diacritics  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: BUG #13440: unaccent does not remove all diacritics  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: BUG #13440: unaccent does not remove all diacritics  (Léonard Benedetti <benedetti@mlpo.fr>)
Список pgsql-bugs
On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I'm really dubious that we should be translating those ligatures at
>>> all (since the standard file is only advertised to do "unaccenting"),
>>> and if we do translate them, shouldn't they convert to AE, ae, etc?
>
>> Perhaps these conversions are intended only for comparisons, full text
>> indexing etc but not showing the converted text to a user, in which
>> case it doesn't matter too much if the conversions are a bit weird
>> (œuf and oeuf are interchangeable in French, but euf is nonsense).
>> But can we actually change them?  That could cause difficulty for
>> users with existing unaccented data stored/indexed...  But I suppose
>> even adding new mappings could cause problems.
>
> Yeah, if we do anything other than adding new mappings, I suspect that
> part could not be back-patched.  Maybe adding new mappings shouldn't
> be back-patched either, though it seems relatively safe to me.
>
>> Right, that does seem a little bit weak.  Instead of making
>> assumptions about the format of those names, we could make use of the
>> precomposed -> composed character mappings in the file.  We could look
>> for characters in the "letters" category where there is decomposition
>> information (ie combining characters for the individual accents) and
>> the base character is [a-zA-Z].  See attached.  This produces 411
>> mappings (including the 14 extras).  I didn't spend the time to figure
>> out which 300 odd characters were dropped but I noticed that our
>> Romanian characters of interest are definitely in.
>
> I took a quick look at this list and it seems fairly sane as far as the
> automatically-generated items go, except that I see it hits a few
> LIGATURE cases (including the existing ij cases, but also fi fl and ffl).
> I'm still quite dubious that that is appropriate; at least, if we do it
> I think we should be expanding out to the equivalent multi-letter form,
> not simply taking one of the letters and dropping the rest.  Anybody else
> have an opinion on how to handle ligatures?

Here is a version that optionally expands ligatures if asked to with
--expand-ligatures.  It uses the Unicode 'general category' data to
identify and strip diacritical marks and distinguish them from
ligatures which are expanded to all their parts.  It meant I had to
load a bunch of stuff into memory up front, but this approach can
handle an awkward bunch of ligatures whose component characters have
marks: DŽ, Dž, dž -> DZ, Dz, dz.  (These are considered to be single
characters to maintain a one-to-one mapping with certain Cyrillic
characters in some Balkan countries which use or used both scripts.)

As for whether we *should* expand ligatures, I'm pretty sure that's
what I'd always want, but my only direct experience of languages with
ligatures as part of the orthography (rather than ligatures as a
typesetting artefact like ffl et al) is French, where œ is used in the
official spelling of a bunch of words like œil, sœur, cœur, œuvre when
they appear in books, but substituting oe is acceptable on computers
because neither the standard French keyboard nor the historically
important Latin1 character set includes the character.  I'm fairly
sure the Dutch have a similar situation with IJ, it's completely
interchangeable with the sequence IJ.

So +1 from me for ligature expansion.  It might be tempting to think
that a function called 'unaccent' should only remove diacritical
marks, but if we are going to be pedantic about it, not all
diacritical marks are actually accents anyway...

> The manually added special cases don't look any saner than they did
> before :-(.  Anybody have an objection to removing those (except maybe
> dotless i) in HEAD?

+1 from me for getting rid of the bogus œ->e, IJ -> I, ... transformations, but:

1.  For some reason œ, æ (and uppercase equivalents) don't have
combining character data in the Unicode file, so they still need to be
treated as special cases if we're going to include ligatures.  Their
expansion should of course be oe and ae rather that what we have.
2.  Likewise ß still needs special treatment (it may be historically
composed of sz but Unicode doesn't know that, it's its own character
now and expands to ss anyway).
3.  I don't see any reason to drop the Afrikaans ʼn, though it should
surely be expanded to 'n rather than n.
4.  I have no clue about whether the single Cyrillic item in there
belongs there.

Just by the way, there are conventional rules for diacritic removal in
some languages, like ä, ö, ü -> ae, oe, ue in German, å -> aa in
Scandinavian languages and è -> e' in Italian.  A German friend of
mine has a ü in his last name and he finishes up with any of three
possible spellings of his name on various official documents, credit
cards etc as a result!  But these sorts of things are specific to
individual languages and don't belong in a general accent removal rule
file (it would be inappropriate to convert French aigüe to aiguee or
Spanish pingüino to pingueino).  I guess speakers of those languages
could consider submitting rules files for language-specific
conventions like that.

--
Thomas Munro
http://www.enterprisedb.com

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: BUG #13440: unaccent does not remove all diacritics
Следующее
От: 德哥
Дата:
Сообщение: Re: BUG #13453: PostgreSQL 9.5dev pgbench exponential distribution bug? (when threshold is small)