Re: BUG #13440: unaccent does not remove all diacritics

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: BUG #13440: unaccent does not remove all diacritics
Дата
Msg-id 1790.1434492074@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: BUG #13440: unaccent does not remove all diacritics  (Thomas Munro <thomas.munro@enterprisedb.com>)
Ответы Re: BUG #13440: unaccent does not remove all diacritics  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-bugs
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Tue, Jun 16, 2015 at 8:07 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> It looks like Romanian also has s with comma.  Perhaps we should have
>> all these characters:
>>
>> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
>> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
>> 702

> Here is an unaccent.rules file that maps those 702 characters from
> Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
> ..." to their base letter, plus 14 extra cases to match the existing
> unaccent.rules file.  If you sort and diff this and the existing file,
> you can see that this file only adds new lines.  Also, here is the
> script I used to build it from UnicodeData.txt.

Hm.  The "extra cases" are pretty disturbing, because some of them sure
look like bugs; which makes me wonder how closely the unaccent.rules
file was vetted to begin with.  For those following along at home,
here are Thomas' extra cases, annotated by me with the Unicode file's
description of each source character:

    print_record(0x00c6, "A") # LATIN CAPITAL LETTER AE
    print_record(0x00df, "S") # LATIN SMALL LETTER SHARP S
    print_record(0x00e6, "a") # LATIN SMALL LETTER AE
    print_record(0x0131, "i") # LATIN SMALL LETTER DOTLESS I
    print_record(0x0132, "I") # LATIN CAPITAL LIGATURE IJ
    print_record(0x0133, "i") # LATIN SMALL LIGATURE IJ
    print_record(0x0138, "k") # LATIN SMALL LETTER KRA
    print_record(0x0149, "n") # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
    print_record(0x014a, "N") # LATIN CAPITAL LETTER ENG
    print_record(0x014b, "n") # LATIN SMALL LETTER ENG
    print_record(0x0152, "E") # LATIN CAPITAL LIGATURE OE
    print_record(0x0153, "e") # LATIN SMALL LIGATURE OE
    print_record(0x0401, u"\u0415") # CYRILLIC CAPITAL LETTER IO
    print_record(0x0451, u"\u0435") # CYRILLIC SMALL LETTER IO

I'm really dubious that we should be translating those ligatures at
all (since the standard file is only advertised to do "unaccenting"),
and if we do translate them, shouldn't they convert to AE, ae, etc?

Also unclear why we're dealing with KRA and ENG but not any of the
other marginal letters that Unicode labels as LATIN (what the heck
is an "AFRICAN D", for instance?)

Also, while my German is nearly nonexistent, I had the idea that sharp-S
to "S" would be considered a case-folding transformation not an accent
removal.  Comments from German speakers welcome of course.

Likewise dubious about those Cyrillic entries, although I suppose
Teodor probably had good reasons for including them.

On the other side of the coin, I think Thomas' regex might have swept up a
bit too much.  I did this to see what sort of decorations were described:

$ egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' UnicodeData.txt | sed 's/.* WITH //' | sed 's/;.*//' | sort | uniq
-c
  34 ACUTE
   2 ACUTE AND DOT ABOVE
   4 BAR
   2 BELT
  12 BREVE
   2 BREVE AND ACUTE
   2 BREVE AND DOT BELOW
   2 BREVE AND GRAVE
   2 BREVE AND HOOK ABOVE
   2 BREVE AND TILDE
   2 BREVE BELOW
  34 CARON
   2 CARON AND DOT ABOVE
  22 CEDILLA
   2 CEDILLA AND ACUTE
   2 CEDILLA AND BREVE
  26 CIRCUMFLEX
   6 CIRCUMFLEX AND ACUTE
   6 CIRCUMFLEX AND DOT BELOW
   6 CIRCUMFLEX AND GRAVE
   6 CIRCUMFLEX AND HOOK ABOVE
   6 CIRCUMFLEX AND TILDE
  12 CIRCUMFLEX BELOW
   4 COMMA BELOW
   4 CROSSED-TAIL
   7 CURL
   8 DESCENDER
  19 DIAERESIS
   4 DIAERESIS AND ACUTE
   2 DIAERESIS AND CARON
   2 DIAERESIS AND GRAVE
   6 DIAERESIS AND MACRON
   2 DIAERESIS BELOW
   8 DIAGONAL STROKE
  39 DOT ABOVE
   4 DOT ABOVE AND MACRON
  38 DOT BELOW
   2 DOT BELOW AND DOT ABOVE
   4 DOT BELOW AND MACRON
   4 DOUBLE ACUTE
   2 DOUBLE BAR
  12 DOUBLE GRAVE
   1 DOUBLE MIDDLE TILDE
   1 FISHHOOK
   1 FISHHOOK AND MIDDLE TILDE
   5 FLOURISH
  16 GRAVE
   2 HIGH STROKE
  30 HOOK
  12 HOOK ABOVE
   1 HOOK AND TAIL
   1 HOOK TAIL
   4 HORN
   4 HORN AND ACUTE
   4 HORN AND DOT BELOW
   4 HORN AND GRAVE
   4 HORN AND HOOK ABOVE
   4 HORN AND TILDE
  12 INVERTED BREVE
   1 INVERTED LAZY S
   3 LEFT HOOK
  17 LINE BELOW
   1 LONG LEFT LEG
   1 LONG LEFT LEG AND LOW RIGHT RING
   1 LONG LEG
   2 LONG RIGHT LEG
   2 LONG STROKE OVERLAY
   4 LOOP
   1 LOW RIGHT RING
   1 LOW RING INSIDE
  14 MACRON
   4 MACRON AND ACUTE
   2 MACRON AND DIAERESIS
   4 MACRON AND GRAVE
   2 MIDDLE DOT
   1 MIDDLE RING
  13 MIDDLE TILDE
   1 NOTCH
  10 OBLIQUE STROKE
  10 OGONEK
   2 OGONEK AND MACRON
  17 PALATAL HOOK
   9 RETROFLEX HOOK
   1 RETROFLEX HOOK AND BELT
   1 RIGHT HALF RING
   1 RIGHT HOOK
   6 RING ABOVE
   2 RING ABOVE AND ACUTE
   2 RING BELOW
   1 SERIF
   2 SHORT RIGHT LEG
   2 SMALL LETTER J
   1 SMALL LETTER Z
   2 SQUIRREL TAIL
  36 STROKE
   2 STROKE AND ACUTE
   2 STROKE AND DIAGONAL STROKE
   4 STROKE THROUGH DESCENDER
   4 SWASH TAIL
   3 TAIL
  16 TILDE
   4 TILDE AND ACUTE
   2 TILDE AND DIAERESIS
   2 TILDE AND MACRON
   6 TILDE BELOW
   4 TOPBAR

Do we really need to expand the rule list fivefold to get rid of things
like FISHHOOK and SQUIRREL TAIL?  Is removing those sorts of things even
legitimately "unaccenting"?  I dunno, but I think it would be good to
have some consensus about what we want this file to do.  I'm not sure
that we should be basing the transformation on minor phrasing details
in the Unicode data file.

            regards, tom lane

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Marko Tiikkaja
Дата:
Сообщение: Re: BUG #13444: psql can't recover a pg_dump.
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #13444: psql can't recover a pg_dump.