Обсуждение: BUG #13440: unaccent does not remove all diacritics

Поиск
Список
Период
Сортировка

BUG #13440: unaccent does not remove all diacritics

От
mike@busbud.com
Дата:
The following bug has been logged on the website:

Bug reference:      13440
Logged by:          Mike Gradek
Email address:      mike@busbud.com
PostgreSQL version: 9.3.5
Operating system:   Mac OS X 10.10.3 (14D136)
Description:

Sorry, I couldn't install the most recent minor release, but I did try this
on several different versions. I used Heroku to try a 9.4.3 build, and got
the same results

select 'ț' as input, unaccent('ț') as observed, 't' as expected;
 input | observed | expected
-------+----------+----------
 ț     | ț        | t
(1 row)

Let me know how I can help resolve this bug, or if it's expected.

Best regards,
Mike

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
mike@busbud.com writes:
> Sorry, I couldn't install the most recent minor release, but I did try this
> on several different versions. I used Heroku to try a 9.4.3 build, and got
> the same results

> select 'ț' as input, unaccent('ț') as observed, 't' as expected;
>  input | observed | expected
> -------+----------+----------
>  ț     | ț        | t
> (1 row)

Hm, I do see

Å£    t

in unaccent.rules, so the transformation ought to happen.  I suspect
an encoding issue, eg your terminal window is not transmitting characters
in the encoding Postgres thinks you're using.  You did not provide any
info about server encoding, client encoding, or client LC_xxx environment,
so it's hard to debug from here.

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Michael Gradek
Дата:
Hi Tom,

Thanks for looking into this issue. Would this help?

> psql -l

                                                List of databases

          Name          |     Owner     | Encoding |   Collate   |
Ctype    |        Access privileges

------------------------+---------------+----------+-------------+---------=
----+---------------------------------

 grand-central          | michaelgradek | UTF8     | en_US.UTF-8 |
en_US.UTF-8 |


Here's a case showing the transformation failing, and another succeeding

> psql grand-central

psql (9.4.1, server 9.3.5)

Type "help" for help.


grand-central=3D# select '=C8=9B' as input, unaccent('=C8=9B') as observed,=
 't' as
expected;

 input | observed | expected

-------+----------+----------

 =C8=9B     | =C8=9B        | t

(1 row)


grand-central=3D# select '=C3=A9' as input, unaccent('=C3=A9') as observed,=
 'e' as
expected;

 input | observed | expected

-------+----------+----------

 =C3=A9     | e        | e

(1 row)


On Sun, Jun 14, 2015 at 1:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> mike@busbud.com writes:
> > Sorry, I couldn't install the most recent minor release, but I did try
> this
> > on several different versions. I used Heroku to try a 9.4.3 build, and
> got
> > the same results
>
> > select '=C8=9B' as input, unaccent('=C8=9B') as observed, 't' as expect=
ed;
> >  input | observed | expected
> > -------+----------+----------
> >  =C8=9B     | =C8=9B        | t
> > (1 row)
>
> Hm, I do see
>
> =C5=A3       t
>
> in unaccent.rules, so the transformation ought to happen.  I suspect
> an encoding issue, eg your terminal window is not transmitting characters
> in the encoding Postgres thinks you're using.  You did not provide any
> info about server encoding, client encoding, or client LC_xxx environment=
,
> so it's hard to debug from here.
>
>                         regards, tom lane
>



--=20
Cheers,
Mike
--=20
Mike Gradek
Co-founder and CTO, Busbud
Busbud.com <http://busbud.com/> | mike@busbud.com
*We're hiring!: Jobs at Busbud <http://www.busbud.com/en/about/jobs>*

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Mon, Jun 15, 2015 at 5:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> mike@busbud.com writes:
>> Sorry, I couldn't install the most recent minor release, but I did try t=
his
>> on several different versions. I used Heroku to try a 9.4.3 build, and g=
ot
>> the same results
>
>> select '=C8=9B' as input, unaccent('=C8=9B') as observed, 't' as expecte=
d;
>>  input | observed | expected
>> -------+----------+----------
>>  =C8=9B     | =C8=9B        | t
>> (1 row)
>
> Hm, I do see
>
> =C5=A3       t
>
> in unaccent.rules, so the transformation ought to happen.  I suspect
> an encoding issue, eg your terminal window is not transmitting characters
> in the encoding Postgres thinks you're using.  You did not provide any
> info about server encoding, client encoding, or client LC_xxx environment=
,
> so it's hard to debug from here.

The one that is in unaccent.rules is apparently t-cedilla, from Gagauz
and Romanian:

https://en.wiktionary.org/wiki/%C5%A3

The one that is referred to above is apparently t-comma, from Livonian
and Romanian, but "[o]ften replaced by =C5=A2 / =C5=A3 (t with cedilla),
especially in computing":

https://en.wiktionary.org/wiki/%C8%9B

--=20
Thomas Munro
http://www.enterprisedb.com

Re: BUG #13440: unaccent does not remove all diacritics

От
Alvaro Herrera
Дата:
Michael Gradek wrote:

> grand-central=# select 'ț' as input, unaccent('ț') as observed, 't' as
> expected;
>
>  input | observed | expected
>
> -------+----------+----------
>
>  ț     | ț        | t

> > Hm, I do see
> >
> > ţ       t

My terminal shows these characters to be different.  One is
http://graphemica.com/%C8%9B
    latin small letter t with comma below (U+021B)

The other is
http://graphemica.com/%C5%A3
    latin small letter t with cedilla (U+0163)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> My terminal shows these characters to be different.  One is
> http://graphemica.com/%C8%9B
>     latin small letter t with comma below (U+021B)

> The other is
> http://graphemica.com/%C5%A3
>     latin small letter t with cedilla (U+0163)

Ah-hah -- I did not look closely enough.  So the immediate answer for
Michael is to add another entry to his unaccent.rules file.

Should we add the missing character to the standard unaccent.rules file?
I should think so in HEAD at least, but what about back-patching?

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> My terminal shows these characters to be different.  One is
>> http://graphemica.com/%C8%9B
>>       latin small letter t with comma below (U+021B)
>
>> The other is
>> http://graphemica.com/%C5%A3
>>       latin small letter t with cedilla (U+0163)
>
> Ah-hah -- I did not look closely enough.  So the immediate answer for
> Michael is to add another entry to his unaccent.rules file.
>
> Should we add the missing character to the standard unaccent.rules file?

It looks like Romanian also has s with comma.  Perhaps we should have
all these characters:

$ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
     702

That's quite a lot more than the 187 we currently have.  Of those, I
think only the following ligature characters don't fit the above
pattern: =C3=86, =C3=A6, =C4=B2, =C4=B3, =C5=92, =C5=93, =C3=9F.  Incidenta=
lly, I don't believe that the
way we "unaccent" ligatures is correct anyway.  Maybe they should be
expanded to AE, ae, IJ, ij, OE, oe, ss, respectively, not A, a, I, i,
O, o, S as we have it, but I guess it depends what the purpose of
unaccent is...

--=20
Thomas Munro
http://www.enterprisedb.com

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Tue, Jun 16, 2015 at 8:07 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> My terminal shows these characters to be different.  One is
>>> http://graphemica.com/%C8%9B
>>>       latin small letter t with comma below (U+021B)
>>
>>> The other is
>>> http://graphemica.com/%C5%A3
>>>       latin small letter t with cedilla (U+0163)
>>
>> Ah-hah -- I did not look closely enough.  So the immediate answer for
>> Michael is to add another entry to his unaccent.rules file.
>>
>> Should we add the missing character to the standard unaccent.rules file?
>
> It looks like Romanian also has s with comma.  Perhaps we should have
> all these characters:
>
> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
>      702

Here is an unaccent.rules file that maps those 702 characters from
Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
..." to their base letter, plus 14 extra cases to match the existing
unaccent.rules file.  If you sort and diff this and the existing file,
you can see that this file only adds new lines.  Also, here is the
script I used to build it from UnicodeData.txt.

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Michael Gradek
Дата:
Thanks everyone, I've been comparing the behavior to that of
https://github.com/andrewrk/node-diacritics/blob/master/index.js if that
can be of any help.

On Monday, June 15, 2015, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

> On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl@sss.pgh.pa.us
> <javascript:;>> wrote:
> > Alvaro Herrera <alvherre@2ndquadrant.com <javascript:;>> writes:
> >> My terminal shows these characters to be different.  One is
> >> http://graphemica.com/%C8%9B
> >>       latin small letter t with comma below (U+021B)
> >
> >> The other is
> >> http://graphemica.com/%C5%A3
> >>       latin small letter t with cedilla (U+0163)
> >
> > Ah-hah -- I did not look closely enough.  So the immediate answer for
> > Michael is to add another entry to his unaccent.rules file.
> >
> > Should we add the missing character to the standard unaccent.rules file=
?
>
> It looks like Romanian also has s with comma.  Perhaps we should have
> all these characters:
>
> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
>      702
>
> That's quite a lot more than the 187 we currently have.  Of those, I
> think only the following ligature characters don't fit the above
> pattern: =C3=86, =C3=A6, =C4=B2, =C4=B3, =C5=92, =C5=93, =C3=9F.  Inciden=
tally, I don't believe that the
> way we "unaccent" ligatures is correct anyway.  Maybe they should be
> expanded to AE, ae, IJ, ij, OE, oe, ss, respectively, not A, a, I, i,
> O, o, S as we have it, but I guess it depends what the purpose of
> unaccent is...
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>


--=20
Cheers,
Mike
--=20
Mike Gradek
Co-founder and CTO, Busbud
Busbud.com <http://busbud.com/> | mike@busbud.com
*We're hiring!: Jobs at Busbud <http://www.busbud.com/en/about/jobs>*

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Tue, Jun 16, 2015 at 8:07 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> It looks like Romanian also has s with comma.  Perhaps we should have
>> all these characters:
>>
>> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
>> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
>> 702

> Here is an unaccent.rules file that maps those 702 characters from
> Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
> ..." to their base letter, plus 14 extra cases to match the existing
> unaccent.rules file.  If you sort and diff this and the existing file,
> you can see that this file only adds new lines.  Also, here is the
> script I used to build it from UnicodeData.txt.

Hm.  The "extra cases" are pretty disturbing, because some of them sure
look like bugs; which makes me wonder how closely the unaccent.rules
file was vetted to begin with.  For those following along at home,
here are Thomas' extra cases, annotated by me with the Unicode file's
description of each source character:

    print_record(0x00c6, "A") # LATIN CAPITAL LETTER AE
    print_record(0x00df, "S") # LATIN SMALL LETTER SHARP S
    print_record(0x00e6, "a") # LATIN SMALL LETTER AE
    print_record(0x0131, "i") # LATIN SMALL LETTER DOTLESS I
    print_record(0x0132, "I") # LATIN CAPITAL LIGATURE IJ
    print_record(0x0133, "i") # LATIN SMALL LIGATURE IJ
    print_record(0x0138, "k") # LATIN SMALL LETTER KRA
    print_record(0x0149, "n") # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
    print_record(0x014a, "N") # LATIN CAPITAL LETTER ENG
    print_record(0x014b, "n") # LATIN SMALL LETTER ENG
    print_record(0x0152, "E") # LATIN CAPITAL LIGATURE OE
    print_record(0x0153, "e") # LATIN SMALL LIGATURE OE
    print_record(0x0401, u"\u0415") # CYRILLIC CAPITAL LETTER IO
    print_record(0x0451, u"\u0435") # CYRILLIC SMALL LETTER IO

I'm really dubious that we should be translating those ligatures at
all (since the standard file is only advertised to do "unaccenting"),
and if we do translate them, shouldn't they convert to AE, ae, etc?

Also unclear why we're dealing with KRA and ENG but not any of the
other marginal letters that Unicode labels as LATIN (what the heck
is an "AFRICAN D", for instance?)

Also, while my German is nearly nonexistent, I had the idea that sharp-S
to "S" would be considered a case-folding transformation not an accent
removal.  Comments from German speakers welcome of course.

Likewise dubious about those Cyrillic entries, although I suppose
Teodor probably had good reasons for including them.

On the other side of the coin, I think Thomas' regex might have swept up a
bit too much.  I did this to see what sort of decorations were described:

$ egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' UnicodeData.txt | sed 's/.* WITH //' | sed 's/;.*//' | sort | uniq
-c
  34 ACUTE
   2 ACUTE AND DOT ABOVE
   4 BAR
   2 BELT
  12 BREVE
   2 BREVE AND ACUTE
   2 BREVE AND DOT BELOW
   2 BREVE AND GRAVE
   2 BREVE AND HOOK ABOVE
   2 BREVE AND TILDE
   2 BREVE BELOW
  34 CARON
   2 CARON AND DOT ABOVE
  22 CEDILLA
   2 CEDILLA AND ACUTE
   2 CEDILLA AND BREVE
  26 CIRCUMFLEX
   6 CIRCUMFLEX AND ACUTE
   6 CIRCUMFLEX AND DOT BELOW
   6 CIRCUMFLEX AND GRAVE
   6 CIRCUMFLEX AND HOOK ABOVE
   6 CIRCUMFLEX AND TILDE
  12 CIRCUMFLEX BELOW
   4 COMMA BELOW
   4 CROSSED-TAIL
   7 CURL
   8 DESCENDER
  19 DIAERESIS
   4 DIAERESIS AND ACUTE
   2 DIAERESIS AND CARON
   2 DIAERESIS AND GRAVE
   6 DIAERESIS AND MACRON
   2 DIAERESIS BELOW
   8 DIAGONAL STROKE
  39 DOT ABOVE
   4 DOT ABOVE AND MACRON
  38 DOT BELOW
   2 DOT BELOW AND DOT ABOVE
   4 DOT BELOW AND MACRON
   4 DOUBLE ACUTE
   2 DOUBLE BAR
  12 DOUBLE GRAVE
   1 DOUBLE MIDDLE TILDE
   1 FISHHOOK
   1 FISHHOOK AND MIDDLE TILDE
   5 FLOURISH
  16 GRAVE
   2 HIGH STROKE
  30 HOOK
  12 HOOK ABOVE
   1 HOOK AND TAIL
   1 HOOK TAIL
   4 HORN
   4 HORN AND ACUTE
   4 HORN AND DOT BELOW
   4 HORN AND GRAVE
   4 HORN AND HOOK ABOVE
   4 HORN AND TILDE
  12 INVERTED BREVE
   1 INVERTED LAZY S
   3 LEFT HOOK
  17 LINE BELOW
   1 LONG LEFT LEG
   1 LONG LEFT LEG AND LOW RIGHT RING
   1 LONG LEG
   2 LONG RIGHT LEG
   2 LONG STROKE OVERLAY
   4 LOOP
   1 LOW RIGHT RING
   1 LOW RING INSIDE
  14 MACRON
   4 MACRON AND ACUTE
   2 MACRON AND DIAERESIS
   4 MACRON AND GRAVE
   2 MIDDLE DOT
   1 MIDDLE RING
  13 MIDDLE TILDE
   1 NOTCH
  10 OBLIQUE STROKE
  10 OGONEK
   2 OGONEK AND MACRON
  17 PALATAL HOOK
   9 RETROFLEX HOOK
   1 RETROFLEX HOOK AND BELT
   1 RIGHT HALF RING
   1 RIGHT HOOK
   6 RING ABOVE
   2 RING ABOVE AND ACUTE
   2 RING BELOW
   1 SERIF
   2 SHORT RIGHT LEG
   2 SMALL LETTER J
   1 SMALL LETTER Z
   2 SQUIRREL TAIL
  36 STROKE
   2 STROKE AND ACUTE
   2 STROKE AND DIAGONAL STROKE
   4 STROKE THROUGH DESCENDER
   4 SWASH TAIL
   3 TAIL
  16 TILDE
   4 TILDE AND ACUTE
   2 TILDE AND DIAERESIS
   2 TILDE AND MACRON
   6 TILDE BELOW
   4 TOPBAR

Do we really need to expand the rule list fivefold to get rid of things
like FISHHOOK and SQUIRREL TAIL?  Is removing those sorts of things even
legitimately "unaccenting"?  I dunno, but I think it would be good to
have some consensus about what we want this file to do.  I'm not sure
that we should be basing the transformation on minor phrasing details
in the Unicode data file.

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> Here is an unaccent.rules file that maps those 702 characters from
>> Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
>> ..." to their base letter, plus 14 extra cases to match the existing
>> unaccent.rules file.  If you sort and diff this and the existing file,
>> you can see that this file only adds new lines.  Also, here is the
>> script I used to build it from UnicodeData.txt.
>
> Hm.  The "extra cases" are pretty disturbing, because some of them sure
> look like bugs; which makes me wonder how closely the unaccent.rules
> file was vetted to begin with.  For those following along at home,
> here are Thomas' extra cases, annotated by me with the Unicode file's
> description of each source character:
>
>     print_record(0x00c6, "A") # LATIN CAPITAL LETTER AE
>     print_record(0x00df, "S") # LATIN SMALL LETTER SHARP S
>     print_record(0x00e6, "a") # LATIN SMALL LETTER AE
>     print_record(0x0131, "i") # LATIN SMALL LETTER DOTLESS I
>     print_record(0x0132, "I") # LATIN CAPITAL LIGATURE IJ
>     print_record(0x0133, "i") # LATIN SMALL LIGATURE IJ
>     print_record(0x0138, "k") # LATIN SMALL LETTER KRA
>     print_record(0x0149, "n") # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
>     print_record(0x014a, "N") # LATIN CAPITAL LETTER ENG
>     print_record(0x014b, "n") # LATIN SMALL LETTER ENG
>     print_record(0x0152, "E") # LATIN CAPITAL LIGATURE OE
>     print_record(0x0153, "e") # LATIN SMALL LIGATURE OE
>     print_record(0x0401, u"\u0415") # CYRILLIC CAPITAL LETTER IO
>     print_record(0x0451, u"\u0435") # CYRILLIC SMALL LETTER IO
>
> I'm really dubious that we should be translating those ligatures at
> all (since the standard file is only advertised to do "unaccenting"),
> and if we do translate them, shouldn't they convert to AE, ae, etc?

Perhaps these conversions are intended only for comparisons, full text
indexing etc but not showing the converted text to a user, in which
case it doesn't matter too much if the conversions are a bit weird
(œuf and oeuf are interchangeable in French, but euf is nonsense).
But can we actually change them?  That could cause difficulty for
users with existing unaccented data stored/indexed...  But I suppose
even adding new mappings could cause problems.

> Also unclear why we're dealing with KRA and ENG but not any of the
> other marginal letters that Unicode labels as LATIN (what the heck
> is an "AFRICAN D", for instance?)
>
> Also, while my German is nearly nonexistent, I had the idea that sharp-S
> to "S" would be considered a case-folding transformation not an accent
> removal.  Comments from German speakers welcome of course.
>
> Likewise dubious about those Cyrillic entries, although I suppose
> Teodor probably had good reasons for including them.
>
> On the other side of the coin, I think Thomas' regex might have swept up a
> bit too much.  I did this to see what sort of decorations were described:
>
> $ egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' UnicodeData.txt | sed 's/.* WITH //' | sed 's/;.*//' | sort |
uniq-c 
>   34 ACUTE
> ...snip...
>    4 TOPBAR
>
> Do we really need to expand the rule list fivefold to get rid of things
> like FISHHOOK and SQUIRREL TAIL?  Is removing those sorts of things even
> legitimately "unaccenting"?  I dunno, but I think it would be good to
> have some consensus about what we want this file to do.  I'm not sure
> that we should be basing the transformation on minor phrasing details
> in the Unicode data file.

Right, that does seem a little bit weak.  Instead of making
assumptions about the format of those names, we could make use of the
precomposed -> composed character mappings in the file.  We could look
for characters in the "letters" category where there is decomposition
information (ie combining characters for the individual accents) and
the base character is [a-zA-Z].  See attached.  This produces 411
mappings (including the 14 extras).  I didn't spend the time to figure
out which 300 odd characters were dropped but I noticed that our
Romanian characters of interest are definitely in.

(There is a separate can of worms here about whether to deal with
decomposed text...)

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Curd Reinert
Дата:
Tom Lane <tgl@sss.pgh.pa.us> schrieb am 17.06.2015 00:01:48:
 > Also, while my German is nearly nonexistent, I had the idea that sharp-S
 > to "S" would be considered a case-folding transformation not an accent
 > removal.  Comments from German speakers welcome of course.
The sharp-s 'ß' is historically a ligature of two different kinds of s,
of which the first one looks more like an f and the second one looks
either like a normal 's' or a 'z' (that's why it is called 'szlig' in
html). It is usually considered to be a lower-case only character, event
though an uppercase sharp-s has recently been defined. If you are using
an encoding that doesn't support 'ß', the rule is to substitute it with
'ss'. If you want to capitalize a word containing a 'ß', you substitute
it with 'SS'. For sorting purposes, DIN 5007 says that 'ß' should be
treated as 'ss'.

That's just the German point of view. Thinks can be a little bit
different in other german speaking countries, e.g. in Switzerland, where
you may always substite 'ß' with 'ss' (even if your encoding has an 'ß').

In short: I would think that replacing 'ß' with 's' is wrong, and
certainly not an accent removal.

Best regards,

Curd

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm really dubious that we should be translating those ligatures at
>> all (since the standard file is only advertised to do "unaccenting"),
>> and if we do translate them, shouldn't they convert to AE, ae, etc?

> Perhaps these conversions are intended only for comparisons, full text
> indexing etc but not showing the converted text to a user, in which
> case it doesn't matter too much if the conversions are a bit weird
> (œuf and oeuf are interchangeable in French, but euf is nonsense).
> But can we actually change them?  That could cause difficulty for
> users with existing unaccented data stored/indexed...  But I suppose
> even adding new mappings could cause problems.

Yeah, if we do anything other than adding new mappings, I suspect that
part could not be back-patched.  Maybe adding new mappings shouldn't
be back-patched either, though it seems relatively safe to me.

> Right, that does seem a little bit weak.  Instead of making
> assumptions about the format of those names, we could make use of the
> precomposed -> composed character mappings in the file.  We could look
> for characters in the "letters" category where there is decomposition
> information (ie combining characters for the individual accents) and
> the base character is [a-zA-Z].  See attached.  This produces 411
> mappings (including the 14 extras).  I didn't spend the time to figure
> out which 300 odd characters were dropped but I noticed that our
> Romanian characters of interest are definitely in.

I took a quick look at this list and it seems fairly sane as far as the
automatically-generated items go, except that I see it hits a few
LIGATURE cases (including the existing ij cases, but also fi fl and ffl).
I'm still quite dubious that that is appropriate; at least, if we do it
I think we should be expanding out to the equivalent multi-letter form,
not simply taking one of the letters and dropping the rest.  Anybody else
have an opinion on how to handle ligatures?

The manually added special cases don't look any saner than they did
before :-(.  Anybody have an objection to removing those (except maybe
dotless i) in HEAD?

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Andres Freund
Дата:
On 2015-06-18 15:30:46 -0400, Tom Lane wrote:
> Yeah, if we do anything other than adding new mappings, I suspect that
> part could not be back-patched.  Maybe adding new mappings shouldn't
> be back-patched either, though it seems relatively safe to me.

Hm. Why is it safe to add new mappings? If previously something has been
indexed with accents because unaccent didn't remove them and you're now
adding a new mapping to unaccent, the tsearch query will lookup the
wrong key (with accents removed)?

Greetings,

Andres Freund

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2015-06-18 15:30:46 -0400, Tom Lane wrote:
>> Yeah, if we do anything other than adding new mappings, I suspect that
>> part could not be back-patched.  Maybe adding new mappings shouldn't
>> be back-patched either, though it seems relatively safe to me.

> Hm. Why is it safe to add new mappings? If previously something has been
> indexed with accents because unaccent didn't remove them and you're now
> adding a new mapping to unaccent, the tsearch query will lookup the
> wrong key (with accents removed)?

This is the same situation as any change whatsoever to tsearch
dictionaries.  The party line on that is that it usually doesn't
matter much, and if it does you can rebuild your indexes.

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Andres Freund
Дата:
On 2015-06-18 16:30:46 -0400, Tom Lane wrote:
> This is the same situation as any change whatsoever to tsearch
> dictionaries.  The party line on that is that it usually doesn't
> matter much, and if it does you can rebuild your indexes.

I think that's an acceptable answer if the user changes their
dictionary, but if we do it for them it's different.

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2015-06-18 16:30:46 -0400, Tom Lane wrote:
>> This is the same situation as any change whatsoever to tsearch
>> dictionaries.  The party line on that is that it usually doesn't
>> matter much, and if it does you can rebuild your indexes.

> I think that's an acceptable answer if the user changes their
> dictionary, but if we do it for them it's different.

So you're arguing that unaccent.rules is forever frozen, no matter whether
it's obviously broken or not?

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Andres Freund
Дата:
On 2015-06-18 16:36:02 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-06-18 16:30:46 -0400, Tom Lane wrote:
> >> This is the same situation as any change whatsoever to tsearch
> >> dictionaries.  The party line on that is that it usually doesn't
> >> matter much, and if it does you can rebuild your indexes.
>
> > I think that's an acceptable answer if the user changes their
> > dictionary, but if we do it for them it's different.
>
> So you're arguing that unaccent.rules is forever frozen, no matter whether
> it's obviously broken or not?

I think it's perfectly sensible to update the rules in master (even if
that has consequences for pg_upgraded databases). I'm just doubtful
about the merits of backpatching changes like this. But I'm not going to
fight hard/any further...

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> I think it's perfectly sensible to update the rules in master (even if
> that has consequences for pg_upgraded databases). I'm just doubtful
> about the merits of backpatching changes like this.

Well, that's certainly a fair position.  How do others feel?

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I'm really dubious that we should be translating those ligatures at
> >> all (since the standard file is only advertised to do "unaccenting"),
> >> and if we do translate them, shouldn't they convert to AE, ae, etc?
>
> > Perhaps these conversions are intended only for comparisons, full text
> > indexing etc but not showing the converted text to a user, in which
> > case it doesn't matter too much if the conversions are a bit weird
> > (œuf and oeuf are interchangeable in French, but euf is nonsense).
> > But can we actually change them?  That could cause difficulty for
> > users with existing unaccented data stored/indexed...  But I suppose
> > even adding new mappings could cause problems.
>
> Yeah, if we do anything other than adding new mappings, I suspect that
> part could not be back-patched.  Maybe adding new mappings shouldn't
> be back-patched either, though it seems relatively safe to me.

To me, conceptually what unaccent does is turn whatever junk you have
into a very basic common alphabet (ascii); then it's very easy to do
full text searches without having to worry about what accents the people
did or did not use in their searches.  If we say "okay, but that funny
char is not an accent so let's leave it alone" then the charter doesn't
sound so useful to me.

The cases I care about are okay anyway, because all the funny chars in
spanish are already covered; and maybe German people always enter their
queries using the funny ss thing I can't even write, and then this is
not a problem for them.


Regarding back-patching unaccent.rules changes as discussed downthread,
I think it's okay to simply document that any indexes using the module
should be reindexed immediately after upgrading to that minor version.
The consequence of not doing so is not *that* serious anyway.  But then,
since I'm not actually affected in any way, I'm not strongly holding
this position either.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I'm really dubious that we should be translating those ligatures at
>>> all (since the standard file is only advertised to do "unaccenting"),
>>> and if we do translate them, shouldn't they convert to AE, ae, etc?
>
>> Perhaps these conversions are intended only for comparisons, full text
>> indexing etc but not showing the converted text to a user, in which
>> case it doesn't matter too much if the conversions are a bit weird
>> (œuf and oeuf are interchangeable in French, but euf is nonsense).
>> But can we actually change them?  That could cause difficulty for
>> users with existing unaccented data stored/indexed...  But I suppose
>> even adding new mappings could cause problems.
>
> Yeah, if we do anything other than adding new mappings, I suspect that
> part could not be back-patched.  Maybe adding new mappings shouldn't
> be back-patched either, though it seems relatively safe to me.
>
>> Right, that does seem a little bit weak.  Instead of making
>> assumptions about the format of those names, we could make use of the
>> precomposed -> composed character mappings in the file.  We could look
>> for characters in the "letters" category where there is decomposition
>> information (ie combining characters for the individual accents) and
>> the base character is [a-zA-Z].  See attached.  This produces 411
>> mappings (including the 14 extras).  I didn't spend the time to figure
>> out which 300 odd characters were dropped but I noticed that our
>> Romanian characters of interest are definitely in.
>
> I took a quick look at this list and it seems fairly sane as far as the
> automatically-generated items go, except that I see it hits a few
> LIGATURE cases (including the existing ij cases, but also fi fl and ffl).
> I'm still quite dubious that that is appropriate; at least, if we do it
> I think we should be expanding out to the equivalent multi-letter form,
> not simply taking one of the letters and dropping the rest.  Anybody else
> have an opinion on how to handle ligatures?

Here is a version that optionally expands ligatures if asked to with
--expand-ligatures.  It uses the Unicode 'general category' data to
identify and strip diacritical marks and distinguish them from
ligatures which are expanded to all their parts.  It meant I had to
load a bunch of stuff into memory up front, but this approach can
handle an awkward bunch of ligatures whose component characters have
marks: DŽ, Dž, dž -> DZ, Dz, dz.  (These are considered to be single
characters to maintain a one-to-one mapping with certain Cyrillic
characters in some Balkan countries which use or used both scripts.)

As for whether we *should* expand ligatures, I'm pretty sure that's
what I'd always want, but my only direct experience of languages with
ligatures as part of the orthography (rather than ligatures as a
typesetting artefact like ffl et al) is French, where œ is used in the
official spelling of a bunch of words like œil, sœur, cœur, œuvre when
they appear in books, but substituting oe is acceptable on computers
because neither the standard French keyboard nor the historically
important Latin1 character set includes the character.  I'm fairly
sure the Dutch have a similar situation with IJ, it's completely
interchangeable with the sequence IJ.

So +1 from me for ligature expansion.  It might be tempting to think
that a function called 'unaccent' should only remove diacritical
marks, but if we are going to be pedantic about it, not all
diacritical marks are actually accents anyway...

> The manually added special cases don't look any saner than they did
> before :-(.  Anybody have an objection to removing those (except maybe
> dotless i) in HEAD?

+1 from me for getting rid of the bogus œ->e, IJ -> I, ... transformations, but:

1.  For some reason œ, æ (and uppercase equivalents) don't have
combining character data in the Unicode file, so they still need to be
treated as special cases if we're going to include ligatures.  Their
expansion should of course be oe and ae rather that what we have.
2.  Likewise ß still needs special treatment (it may be historically
composed of sz but Unicode doesn't know that, it's its own character
now and expands to ss anyway).
3.  I don't see any reason to drop the Afrikaans ʼn, though it should
surely be expanded to 'n rather than n.
4.  I have no clue about whether the single Cyrillic item in there
belongs there.

Just by the way, there are conventional rules for diacritic removal in
some languages, like ä, ö, ü -> ae, oe, ue in German, å -> aa in
Scandinavian languages and è -> e' in Italian.  A German friend of
mine has a ü in his last name and he finishes up with any of three
possible spellings of his name on various official documents, credit
cards etc as a result!  But these sorts of things are specific to
individual languages and don't belong in a general accent removal rule
file (it would be inappropriate to convert French aigüe to aiguee or
Spanish pingüino to pingueino).  I guess speakers of those languages
could consider submitting rules files for language-specific
conventions like that.

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Fri, Jun 19, 2015 at 2:00 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here is a version that optionally expands ligatures if asked to with
> --expand-ligatures. ...

I guess I should have attached the generated output too for more
convenient review, so here they are.

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Emre Hasegeli
Дата:
> To me, conceptually what unaccent does is turn whatever junk you have
> into a very basic common alphabet (ascii); then it's very easy to do
> full text searches without having to worry about what accents the people
> did or did not use in their searches.  If we say "okay, but that funny
> char is not an accent so let's leave it alone" then the charter doesn't
> sound so useful to me.

It is the same for me.  It is unfortunate that this module is named
as "unaccent".  There are many characters on the rule file that has
nothing do with accents.  They are normal letters on some alphabets
which are not in ASCII.  "replace-with-ascii" would be a better name
for it.

> The cases I care about are okay anyway, because all the funny chars in
> spanish are already covered; and maybe German people always enter their
> queries using the funny ss thing I can't even write, and then this is
> not a problem for them.

I am learning German only for a few months, and even I can confirm
that replacing "=C3=9F" with "s", or "=C3=BC" with "u" is wrong.  On the ot=
her
hand if they would be correctly replaced with "ss" and "ou", I would
be really unhappy because it is just too common in Turkish to press
"u" instead of "=C3=BC".

I think it is better for this module to replace those characters with
a single ASCII character that sounds similar.  With this point of
view I think is fine to replace "=C3=9F" with "s" even if it is obviously
wrong.  This module will never be useful for German without breaking
other usages, anyway.  We can try to cover as many characters as
possible keeping this in mind.

It would also be nice support other rules for real "unaccent", and
correct replacement for German.  Maybe we can add different rule
files to this module.

> Regarding back-patching unaccent.rules changes as discussed downthread,
> I think it's okay to simply document that any indexes using the module
> should be reindexed immediately after upgrading to that minor version.
> The consequence of not doing so is not *that* serious anyway.  But then,
> since I'm not actually affected in any way, I'm not strongly holding
> this position either.

I think it would cause more trouble than help, if we ever backpack
changes on this rules.

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Fri, Jun 19, 2015 at 2:00 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>>> On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> I'm really dubious that we should be translating those ligatures at
>>>> all (since the standard file is only advertised to do "unaccenting"),
>>>> and if we do translate them, shouldn't they convert to AE, ae, etc?
>>
>>> Perhaps these conversions are intended only for comparisons, full text
>>> indexing etc but not showing the converted text to a user, in which
>>> case it doesn't matter too much if the conversions are a bit weird
>>> (œuf and oeuf are interchangeable in French, but euf is nonsense).
>>> But can we actually change them?  That could cause difficulty for
>>> users with existing unaccented data stored/indexed...  But I suppose
>>> even adding new mappings could cause problems.
>>
>> Yeah, if we do anything other than adding new mappings, I suspect that
>> part could not be back-patched.  Maybe adding new mappings shouldn't
>> be back-patched either, though it seems relatively safe to me.
>>
>>> Right, that does seem a little bit weak.  Instead of making
>>> assumptions about the format of those names, we could make use of the
>>> precomposed -> composed character mappings in the file.  We could look
>>> for characters in the "letters" category where there is decomposition
>>> information (ie combining characters for the individual accents) and
>>> the base character is [a-zA-Z].  See attached.  This produces 411
>>> mappings (including the 14 extras).  I didn't spend the time to figure
>>> out which 300 odd characters were dropped but I noticed that our
>>> Romanian characters of interest are definitely in.
>>
>> I took a quick look at this list and it seems fairly sane as far as the
>> automatically-generated items go, except that I see it hits a few
>> LIGATURE cases (including the existing ij cases, but also fi fl and ffl).
>> I'm still quite dubious that that is appropriate; at least, if we do it
>> I think we should be expanding out to the equivalent multi-letter form,
>> not simply taking one of the letters and dropping the rest.  Anybody else
>> have an opinion on how to handle ligatures?
>
> Here is a version that optionally expands ligatures if asked to with
> --expand-ligatures.

I looked at this again and noticed a few problems.  I've attached a
new version.  Here is a summary of the changes compared to what is in
master:

* 6 existing ligatures expanded fully: Æ, æ, IJ, ij,  Œ, œ
* 18 new ligatures added: DŽ, Dž, dž, LJ, Lj, lj, NJ, Nj, nj, DZ, Dz, dz, ff, fi, fl, ffi, ffl, st
* ß expanded to ss instead of S
* ʼn expanded to 'n instead of n
* 5 existing characters that involve neither diacritic marks[1] nor
ligatures dropped: ĸ, Ŀ, ŀ, Ŋ, ŋ
* 213 new characters with diacritics added: Ơ, ơ, Ư, ư, Ǎ, ǎ, Ǐ, ǐ, Ǒ,
ǒ, Ǔ, ǔ, Ǧ, ǧ, Ǩ, ǩ, Ǫ, ǫ, ǰ, Ǵ, ǵ, Ǹ, ǹ, Ȁ, ȁ, Ȃ, ȃ, Ȅ, ȅ, Ȇ, ȇ, Ȉ,
ȉ, Ȋ, ȋ, Ȍ, ȍ, Ȏ, ȏ, Ȑ, ȑ, Ȓ, ȓ, Ȕ, ȕ, Ȗ, ȗ, Ș, ș, Ț, ț, Ȟ, ȟ, Ȧ, ȧ,
Ȩ, ȩ, Ȯ, ȯ, Ȳ, ȳ, Ḁ, ḁ, Ḃ, ḃ, Ḅ, ḅ, Ḇ, ḇ, Ḋ, ḋ, Ḍ, ḍ, Ḏ, ḏ, Ḑ, ḑ, Ḓ,
ḓ, Ḙ, ḙ, Ḛ, ḛ, Ḟ, ḟ, Ḡ, ḡ, Ḣ, ḣ, Ḥ, ḥ, Ḧ, ḧ, Ḩ, ḩ, Ḫ, ḫ, Ḭ, ḭ, Ḱ, ḱ,
Ḳ, ḳ, Ḵ, ḵ, Ḷ, ḷ, Ḻ, ḻ, Ḽ, ḽ, Ḿ, ḿ, Ṁ, ṁ, Ṃ, ṃ, Ṅ, ṅ, Ṇ, ṇ, Ṉ, ṉ, Ṋ,
ṋ, Ṕ, ṕ, Ṗ, ṗ, Ṙ, ṙ, Ṛ, ṛ, Ṟ, ṟ, Ṡ, ṡ, Ṣ, ṣ, Ṫ, ṫ, Ṭ, ṭ, Ṯ, ṯ, Ṱ, ṱ,
Ṳ, ṳ, Ṵ, ṵ, Ṷ, ṷ, Ṽ, ṽ, Ṿ, ṿ, Ẁ, ẁ, Ẃ, ẃ, Ẅ, ẅ, Ẇ, ẇ, Ẉ, ẉ, Ẋ, ẋ, Ẍ,
ẍ, Ẏ, ẏ, Ẑ, ẑ, Ẓ, ẓ, Ẕ, ẕ, ẖ, ẗ, ẘ, ẙ, Ạ, ạ, Ả, ả, Ẹ, ẹ, Ẻ, ẻ, Ẽ, ẽ,
Ỉ, ỉ, Ị, ị, Ọ, ọ, Ỏ, ỏ, Ụ, ụ, Ủ, ủ, Ỳ, ỳ, Ỵ, ỵ, Ỷ, ỷ, Ỹ, ỹ

In the previous version I'd missed the LATIN ... WITH STROKE
characters like ø and ł because they aren't treated as diacritics or
ligatures in the Unicode decomposition data (they're just separate
letters, but they have an obvious unadorned ASCII replacement letter
and we already handle these).  There may be a case for replacing ø
with oe[2] but that's not what we do now.  Can any Danish or Norwegian
speakers comment on this?  There are actually 36 characters with names
matching /LATIN (CAPITAL|SMALL) LETTER [A-Z] WITH STROKE/, but I added
only the ones that we already had, namely O, D, H, L and lower case
equivalents.  Many of the rest seem to be obscure specialised
characters not used in real languages.

I don't see why we would take out that Cyrillic character: it seems
like a totally legitimate case[3].  Even though it doesn't fit in with
the idea that some might have of unaccent as the
"make-this-into-plain-ASCII" function, there doesn't seem to be any
reason why we shouldn't be able to handle Latin, Cyrillic and (if
someone with the knowledge wants to add them) Greek characters in the
same rule file -- they are non-overlapping, and all have diacritic
marks which can be stripped to give a basic character set.  That seems
pretty useful for text search type applications, which is what this
feature is for AFAIK.

[1] That L is combining with punctuation, not a mark, according to
Unicode, and generally doesn't seem to be used in any language (unlike
ʼn/'n which is a common word in Afrikaans)
[2] https://en.wikipedia.org/wiki/%C3%98 'In other languages that do
not have the letter as part of the regular alphabet, or in limited
character sets such as ASCII, ø is frequently replaced with the
two-letter combination "oe".'
[3] https://en.wiktionary.org/wiki/%D1%91 'This letter invariably
bears the word stress. However, the diaeresis is usually not used
outside of dictionaries and children’s books, where the letter is
usually written simply as е.'

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Peter Eisentraut
Дата:
On 6/18/15 4:48 PM, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I think it's perfectly sensible to update the rules in master (even if
>> that has consequences for pg_upgraded databases). I'm just doubtful
>> about the merits of backpatching changes like this.
>
> Well, that's certainly a fair position.  How do others feel?

I wouldn't change it in the backbranches either.  It's easy to change
the mapping locally if there is a need.

Re: BUG #13440: unaccent does not remove all diacritics

От
Peter Eisentraut
Дата:
On 6/18/15 5:17 PM, Alvaro Herrera wrote:
> To me, conceptually what unaccent does is turn whatever junk you have
> into a very basic common alphabet (ascii); then it's very easy to do
> full text searches without having to worry about what accents the people
> did or did not use in their searches.  If we say "okay, but that funny
> char is not an accent so let's leave it alone" then the charter doesn't
> sound so useful to me.

I think unaccent is one of those contrib things that are useful but not
really fully thought out and therefore won't ever become an official
core feature.  It is what it is, and we can tweak it slightly, but
thinking too hard about what it "should" do won't lead anywhere.

If we wanted to do this "properly", we could do something like: perform
Unicode canonical decomposition, then strip out all combining
characters.  I don't know how useful that is in practice, though.  And
it won't "solve" issues such as German ß, which probably doesn't have a
one-size-fits-all solution.

Re: BUG #13440: unaccent does not remove all diacritics

От
Andres Freund
Дата:
Hi,

On 2015-06-23 13:00:43 +1200, Thomas Munro wrote:
> I looked at this again and noticed a few problems.  I've attached a
> new version.

My re-reading of the discussion is to commit this to master? I'll do
that unless somebody protests.

To me it seems like a good idea to include the generation script
somewhere? Unicode isn't all that static.

Regards,

Andres Freund

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Thu, Sep 3, 2015 at 4:55 AM, Andres Freund <andres@anarazel.de> wrote:

> Hi,
>
> On 2015-06-23 13:00:43 +1200, Thomas Munro wrote:
> > I looked at this again and noticed a few problems.  I've attached a
> > new version.
>
> My re-reading of the discussion is to commit this to master? I'll do
> that unless somebody protests.
>

What about 9.5?  I thought the idea was not to change preexisting rules in
back branches because of compatibility problems with existing data that
users may have indexed/stored, but 9.5 obviously can't have that problem
yet.


> To me it seems like a good idea to include the generation script
> somewhere? Unicode isn't all that static.
>

+1

--
Thomas Munro
http://www.enterprisedb.com

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Thu, Sep 3, 2015 at 4:55 AM, Andres Freund <andres@anarazel.de> wrote:
>> My re-reading of the discussion is to commit this to master? I'll do
>> that unless somebody protests.

> What about 9.5?

It's too late for 9.5.  We are trying to stabilize that branch not
destabilize it.

FWIW, I wasn't sure we had consensus to commit this at all; if I had
thought that I would've done so back in June.

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Andres Freund
Дата:
On 2015-09-02 19:23:11 -0400, Tom Lane wrote:
> It's too late for 9.5.  We are trying to stabilize that branch not
> destabilize it.

Agreed. Doesn't seem to be pressing enough.

> FWIW, I wasn't sure we had consensus to commit this at all; if I had
> thought that I would've done so back in June.

There seemed to be nobody really arguing against fixing it in master and
several people on board with it?

Greetings,

Andres Freund

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2015-09-02 19:23:11 -0400, Tom Lane wrote:
>> FWIW, I wasn't sure we had consensus to commit this at all; if I had
>> thought that I would've done so back in June.

> There seemed to be nobody really arguing against fixing it in master and
> several people on board with it?

I think it would be good to have a discussion someplace more widely read
than the -bugs list.  -hackers or -general would be the place to find
out if this sort of change is widely acceptable.

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Andres Freund
Дата:
On 2015-09-02 19:33:00 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-09-02 19:23:11 -0400, Tom Lane wrote:
> >> FWIW, I wasn't sure we had consensus to commit this at all; if I had
> >> thought that I would've done so back in June.
>
> > There seemed to be nobody really arguing against fixing it in master and
> > several people on board with it?
>
> I think it would be good to have a discussion someplace more widely read
> than the -bugs list.  -hackers or -general would be the place to find
> out if this sort of change is widely acceptable.

I'm fine with doing that - but my impression from this thread was that
the only thing we're unsure about is wether to backpatch this change?
Weren't you pretty close to backpatching it?

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2015-09-02 19:33:00 -0400, Tom Lane wrote:
>> I think it would be good to have a discussion someplace more widely read
>> than the -bugs list.  -hackers or -general would be the place to find
>> out if this sort of change is widely acceptable.

> I'm fine with doing that - but my impression from this thread was that
> the only thing we're unsure about is wether to backpatch this change?
> Weren't you pretty close to backpatching it?

No, not after someone pointed out that it could have strange side-effects
on full text search configurations that used unaccent.  You'd stop being
able to find documents whenever your search term is stripped of accents
more thoroughly than before.  That might be all right in a new major
release (if it documents that you might have to rebuild your FTS indexes
and derived tsvector columns).  It's not all right in a minor release.

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Andres Freund
Дата:
On 2015-09-02 19:59:37 -0400, Tom Lane wrote:
> No, not after someone pointed out that it could have strange side-effects
> on full text search configurations that used unaccent.  You'd stop being
> able to find documents whenever your search term is stripped of accents
> more thoroughly than before.  That might be all right in a new major
> release (if it documents that you might have to rebuild your FTS indexes
> and derived tsvector columns).  It's not all right in a minor release.

Yes, it was me that pointed that out and argued against it ;)
http://archives.postgresql.org/message-id/20150618202135.GB29350%40alap3.anarazel.de
and following.

Thomas, will you repost context & a patch implementing this (instead of
just files)?

Greetings,

Andres Freund

Re: BUG #13440: unaccent does not remove all diacritics

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> No, not after someone pointed out that it could have strange side-effects
> on full text search configurations that used unaccent.  You'd stop being
> able to find documents whenever your search term is stripped of accents
> more thoroughly than before.  That might be all right in a new major
> release (if it documents that you might have to rebuild your FTS indexes
> and derived tsvector columns).  It's not all right in a minor release.

Hmm, so what happens if you pg_upgrade FTS indexes?  Are they somehow
marked invalid and a REINDEX is forced?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Thu, Sep 3, 2015 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-02 19:59:37 -0400, Tom Lane wrote:
> No, not after someone pointed out that it could have strange side-effects
> on full text search configurations that used unaccent.  You'd stop being
> able to find documents whenever your search term is stripped of accents
> more thoroughly than before.  That might be all right in a new major
> release (if it documents that you might have to rebuild your FTS indexes
> and derived tsvector columns).  It's not all right in a minor release.

Yes, it was me that pointed that out and argued against it ;)
http://archives.postgresql.org/message-id/20150618202135.GB29350%40alap3.anarazel.de
and following.

Thomas, will you repost context & a patch implementing this (instead of
just files)?

Attached.  I gave the script a better name and some comments.

--
Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> No, not after someone pointed out that it could have strange side-effects
>> on full text search configurations that used unaccent.  You'd stop being
>> able to find documents whenever your search term is stripped of accents
>> more thoroughly than before.  That might be all right in a new major
>> release (if it documents that you might have to rebuild your FTS indexes
>> and derived tsvector columns).  It's not all right in a minor release.

> Hmm, so what happens if you pg_upgrade FTS indexes?  Are they somehow
> marked invalid and a REINDEX is forced?

No.  They're not broken in a fundamental way, it's just that certain
search terms no longer find document words you might think they should
match.  Oleg and Teodor argued back at the beginning of the FTS stuff
that this sort of thing wasn't critical, and I agree --- but we shouldn't
change the mapping in minor releases.

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Bruce Momjian
Дата:
On Wed, Sep  2, 2015 at 11:43:52PM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> No, not after someone pointed out that it could have strange side-effects
> >> on full text search configurations that used unaccent.  You'd stop being
> >> able to find documents whenever your search term is stripped of accents
> >> more thoroughly than before.  That might be all right in a new major
> >> release (if it documents that you might have to rebuild your FTS indexes
> >> and derived tsvector columns).  It's not all right in a minor release.
>
> > Hmm, so what happens if you pg_upgrade FTS indexes?  Are they somehow
> > marked invalid and a REINDEX is forced?
>
> No.  They're not broken in a fundamental way, it's just that certain
> search terms no longer find document words you might think they should
> match.  Oleg and Teodor argued back at the beginning of the FTS stuff
> that this sort of thing wasn't critical, and I agree --- but we shouldn't
> change the mapping in minor releases.

Uh, I thought the whole discussion was whether this should be changed in
head _and_ 9.5, or just head.  I didn't think anyone was suggesting
minor releases.  We don't consider a 9.5 change to be a minor release
change at this point, do we?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: BUG #13440: unaccent does not remove all diacritics

От
Teodor Sigaev
Дата:
> No.  They're not broken in a fundamental way, it's just that certain
> search terms no longer find document words you might think they should
> match.  Oleg and Teodor argued back at the beginning of the FTS stuff
> that this sort of thing wasn't critical, and I agree --- but we shouldn't
> change the mapping in minor releases.

Agree. And update for FTS index isn't a pg_upgrade task, because unaccent
dictinary is a filtering dictionary and subsequent dictionaries might change a
recognition of word. So, pg_upgrade without knowledge of FTS configuration can
not do update by correct way.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: BUG #13440: unaccent does not remove all diacritics

От
Teodor Sigaev
Дата:
> Uh, I thought the whole discussion was whether this should be changed in
> head _and_ 9.5, or just head.  I didn't think anyone was suggesting
> minor releases.  We don't consider a 9.5 change to be a minor release
> change at this point, do we?

9.6 only. Users, who affected this bug, could change unaccent.rules manually.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
Le 19/06/2015 04:00, Thomas Munro a écrit :
> On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I took a quick look at this list and it seems fairly sane as far as
>> the automatically-generated items go, except that I see it hits a few
>> LIGATURE cases (including the existing ij cases, but also fi fl and
>> ffl). I'm still quite dubious that that is appropriate; at least, if
>> we do it I think we should be expanding out to the equivalent
>> multi-letter form, not simply taking one of the letters and dropping
>> the rest. Anybody else have an opinion on how to handle ligatures?
> Here is a version that optionally expands ligatures if asked to with
> --expand-ligatures.  It uses the Unicode 'general category' data to
> identify and strip diacritical marks and distinguish them from
> ligatures which are expanded to all their parts.  It meant I had to
> load a bunch of stuff into memory up front, but this approach can
> handle an awkward bunch of ligatures whose component characters have
> marks: DŽ, Dž, dž -> DZ, Dz, dz.  (These are considered to be single
> characters to maintain a one-to-one mapping with certain Cyrillic
> characters in some Balkan countries which use or used both scripts.)
>
> As for whether we *should* expand ligatures, I'm pretty sure that's
> what I'd always want, but my only direct experience of languages with
> ligatures as part of the orthography (rather than ligatures as a
> typesetting artefact like ffl et al) is French, where œ is used in the
> official spelling of a bunch of words like œil, sœur, cœur, œuvre when
> they appear in books, but substituting oe is acceptable on computers
> because neither the standard French keyboard nor the historically
> important Latin1 character set includes the character.  I'm fairly
> sure the Dutch have a similar situation with IJ, it's completely
> interchangeable with the sequence IJ.
>
> So +1 from me for ligature expansion.  It might be tempting to think
> that a function called 'unaccent' should only remove diacritical
> marks, but if we are going to be pedantic about it, not all
> diacritical marks are actually accents anyway...
>
>> The manually added special cases don't look any saner than they did
>> before :-(.  Anybody have an objection to removing those (except maybe
>> dotless i) in HEAD?
> +1 from me for getting rid of the bogus œ->e, IJ -> I, ... transformations, but:
>
> 1.  For some reason œ, æ (and uppercase equivalents) don't have
> combining character data in the Unicode file, so they still need to be
> treated as special cases if we're going to include ligatures.  Their
> expansion should of course be oe and ae rather that what we have.
> 2.  Likewise ß still needs special treatment (it may be historically
> composed of sz but Unicode doesn't know that, it's its own character
> now and expands to ss anyway).
> 3.  I don't see any reason to drop the Afrikaans ʼn, though it should
> surely be expanded to 'n rather than n.
> 4.  I have no clue about whether the single Cyrillic item in there
> belongs there.
>
> Just by the way, there are conventional rules for diacritic removal in
> some languages, like ä, ö, ü -> ae, oe, ue in German, å -> aa in
> Scandinavian languages and è -> e' in Italian.  A German friend of
> mine has a ü in his last name and he finishes up with any of three
> possible spellings of his name on various official documents, credit
> cards etc as a result!  But these sorts of things are specific to
> individual languages and don't belong in a general accent removal rule
> file (it would be inappropriate to convert French aigüe to aiguee or
> Spanish pingüino to pingueino).  I guess speakers of those languages
> could consider submitting rules files for language-specific
> conventions like that.
>
I use "unaccent" and I am very pleased with the applied patches for the
default rules and the Python script to generate them.

But as you pointed out, the "extra cases" (the subset of characters
which is not generated by the script, but hardcoded) are pretty
disturbing. The main problem to me is that it lacks a number of "extra
cases". In fact, the script manages arbitrarily few ligatures but leaves
many things aside. So I looked for a way to improve the generation, to
avoid having this trouble.

As you said, some characters don't have Unicode decomposition. So, to
handle all these cases, we can use the standard Unicode transliterator
Latin-ASCII (available in CLDR), it associates Unicode characters to
ASCII-range equivalent. This approach seems much more elegant, this
avoids hardcoded cases and transliterations are semantically correct (at
least, as much as they can).

So, I modified the script: the arguments of the command line are used to
pass the file path of the transliterator (available as an XML file in
Unicode Common Locale Data Repository), so you find attached the new
script and the generated output for convenience, I will also propose a
patch for Commitfest. Note that the script now takes (at most) two input
files: UnicodeData.txt and (optionally) the XML file of the transliterator.

By the way, I took the opportunity to make the script more user-friendly
by several surface changes. There is now a very light support for
command line arguments with help messages. The text file was, before,
passed to the script on standard input; this approach is not appropriate
when two files must be used. So as I mentioned, the arguments of the
command line are now used to pass the paths.

Finally, the use of this transliterator increase inevitably the number
of characters handled, I do not think it's a problem (there is 1044
characters handled), on the contrary, and after several tests on index
generations, I have no significant performance difference. Nonetheless,
using the transliterator remains optional and a command line option is
available to disable it (so one can easily generate a small rules file,
if desired). It seemed however logical to me to keep it on by default:
that is, a priori, the desired behavior.

Léonard Benedetti

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
24/01/2016 04:18, Léonard Benedetti wrotes :
> Le 19/06/2015 04:00, Thomas Munro a écrit :
>> On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I took a quick look at this list and it seems fairly sane as far as
>>> the automatically-generated items go, except that I see it hits a few
>>> LIGATURE cases (including the existing ij cases, but also fi fl and
>>> ffl). I'm still quite dubious that that is appropriate; at least, if
>>> we do it I think we should be expanding out to the equivalent
>>> multi-letter form, not simply taking one of the letters and dropping
>>> the rest. Anybody else have an opinion on how to handle ligatures?
>> Here is a version that optionally expands ligatures if asked to with
>> --expand-ligatures.  It uses the Unicode 'general category' data to
>> identify and strip diacritical marks and distinguish them from
>> ligatures which are expanded to all their parts.  It meant I had to
>> load a bunch of stuff into memory up front, but this approach can
>> handle an awkward bunch of ligatures whose component characters have
>> marks: DŽ, Dž, dž -> DZ, Dz, dz.  (These are considered to be single
>> characters to maintain a one-to-one mapping with certain Cyrillic
>> characters in some Balkan countries which use or used both scripts.)
>>
>> As for whether we *should* expand ligatures, I'm pretty sure that's
>> what I'd always want, but my only direct experience of languages with
>> ligatures as part of the orthography (rather than ligatures as a
>> typesetting artefact like ffl et al) is French, where œ is used in the
>> official spelling of a bunch of words like œil, sœur, cœur, œuvre when
>> they appear in books, but substituting oe is acceptable on computers
>> because neither the standard French keyboard nor the historically
>> important Latin1 character set includes the character.  I'm fairly
>> sure the Dutch have a similar situation with IJ, it's completely
>> interchangeable with the sequence IJ.
>>
>> So +1 from me for ligature expansion.  It might be tempting to think
>> that a function called 'unaccent' should only remove diacritical
>> marks, but if we are going to be pedantic about it, not all
>> diacritical marks are actually accents anyway...
>>
>>> The manually added special cases don't look any saner than they did
>>> before :-(.  Anybody have an objection to removing those (except maybe
>>> dotless i) in HEAD?
>> +1 from me for getting rid of the bogus œ->e, IJ -> I, ... transformations, but:
>>
>> 1.  For some reason œ, æ (and uppercase equivalents) don't have
>> combining character data in the Unicode file, so they still need to be
>> treated as special cases if we're going to include ligatures.  Their
>> expansion should of course be oe and ae rather that what we have.
>> 2.  Likewise ß still needs special treatment (it may be historically
>> composed of sz but Unicode doesn't know that, it's its own character
>> now and expands to ss anyway).
>> 3.  I don't see any reason to drop the Afrikaans ʼn, though it should
>> surely be expanded to 'n rather than n.
>> 4.  I have no clue about whether the single Cyrillic item in there
>> belongs there.
>>
>> Just by the way, there are conventional rules for diacritic removal in
>> some languages, like ä, ö, ü -> ae, oe, ue in German, å -> aa in
>> Scandinavian languages and è -> e' in Italian.  A German friend of
>> mine has a ü in his last name and he finishes up with any of three
>> possible spellings of his name on various official documents, credit
>> cards etc as a result!  But these sorts of things are specific to
>> individual languages and don't belong in a general accent removal rule
>> file (it would be inappropriate to convert French aigüe to aiguee or
>> Spanish pingüino to pingueino).  I guess speakers of those languages
>> could consider submitting rules files for language-specific
>> conventions like that.
>>
> I use "unaccent" and I am very pleased with the applied patches for the
> default rules and the Python script to generate them.
>
> But as you pointed out, the "extra cases" (the subset of characters
> which is not generated by the script, but hardcoded) are pretty
> disturbing. The main problem to me is that it lacks a number of "extra
> cases". In fact, the script manages arbitrarily few ligatures but leaves
> many things aside. So I looked for a way to improve the generation, to
> avoid having this trouble.
>
> As you said, some characters don't have Unicode decomposition. So, to
> handle all these cases, we can use the standard Unicode transliterator
> Latin-ASCII (available in CLDR), it associates Unicode characters to
> ASCII-range equivalent. This approach seems much more elegant, this
> avoids hardcoded cases and transliterations are semantically correct (at
> least, as much as they can).
>
> So, I modified the script: the arguments of the command line are used to
> pass the file path of the transliterator (available as an XML file in
> Unicode Common Locale Data Repository), so you find attached the new
> script and the generated output for convenience, I will also propose a
> patch for Commitfest. Note that the script now takes (at most) two input
> files: UnicodeData.txt and (optionally) the XML file of the transliterator.
>
> By the way, I took the opportunity to make the script more user-friendly
> by several surface changes. There is now a very light support for
> command line arguments with help messages. The text file was, before,
> passed to the script on standard input; this approach is not appropriate
> when two files must be used. So as I mentioned, the arguments of the
> command line are now used to pass the paths.
>
> Finally, the use of this transliterator increase inevitably the number
> of characters handled, I do not think it's a problem (there is 1044
> characters handled), on the contrary, and after several tests on index
> generations, I have no significant performance difference. Nonetheless,
> using the transliterator remains optional and a command line option is
> available to disable it (so one can easily generate a small rules file,
> if desired). It seemed however logical to me to keep it on by default:
> that is, a priori, the desired behavior.
>
> Léonard Benedetti
Here is the patch, attached.

Léonard Benedetti

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Thomas Munro
Дата:
On Sun, Jan 24, 2016 at 4:18 PM, L=C3=A9onard Benedetti <benedetti@mlpo.fr>=
 wrote:
> I use "unaccent" and I am very pleased with the applied patches for the
> default rules and the Python script to generate them.
>
> But as you pointed out, the "extra cases" (the subset of characters
> which is not generated by the script, but hardcoded) are pretty
> disturbing. The main problem to me is that it lacks a number of "extra
> cases". In fact, the script manages arbitrarily few ligatures but leaves
> many things aside. So I looked for a way to improve the generation, to
> avoid having this trouble.
>
> As you said, some characters don't have Unicode decomposition. So, to
> handle all these cases, we can use the standard Unicode transliterator
> Latin-ASCII (available in CLDR), it associates Unicode characters to
> ASCII-range equivalent. This approach seems much more elegant, this
> avoids hardcoded cases and transliterations are semantically correct (at
> least, as much as they can).

Wow.  It would indeed be nice to use this dataset rather than
maintaining the special cases for =C5=93 et al.  It would also nice to pick
up all those other things like =C2=A9, =C2=BD, =E2=80=A6, =E2=89=AA, =E2=89=
=AB (though these stray a
little bit further from the functionality implied by unaccent's name).
I don't think this alone will completely get rid of the hardcoded
special cases though, because we have these two mappings which look
like Latin but are in fact Cyrillic and I assume we need to keep them:

=D0=81 =D0=95
=D1=91 =D0=B5

Should we extend the composition data analysis to make these remaining
special cases go away?  We'd need a definition of is_plain_letter that
returns True for 0415 so that 0401 can be recognised as 0415 + 0308.
Depending on how you do that, you could sweep in some more Cyrillic
mappings and a ton of stuff from other scripts that have precomposed
diacritic codepoints (Greek, Hebrew, Arabic, ...?), and we'd need
someone with knowledge of relevant languages to sign off on the result
-- so it might make sense to stick to a definition that includes just
Latin and Cyrillic for now.

(Otherwise it might be tempting to use *only* the transliterator
approach, but CLDR doesn't seem to have appropriate transliterator
files for other scripts.  They have for example Cyrillic -> Latin, but
we'd want Cyrillic -> some-subset-of-Cyrillic, analogous to Latin ->
ASCII.)

> So, I modified the script: the arguments of the command line are used to
> pass the file path of the transliterator (available as an XML file in
> Unicode Common Locale Data Repository), so you find attached the new
> script and the generated output for convenience, I will also propose a
> patch for Commitfest. Note that the script now takes (at most) two input
> files: UnicodeData.txt and (optionally) the XML file of the transliterato=
r.
>
> By the way, I took the opportunity to make the script more user-friendly
> by several surface changes. There is now a very light support for
> command line arguments with help messages. The text file was, before,
> passed to the script on standard input; this approach is not appropriate
> when two files must be used. So as I mentioned, the arguments of the
> command line are now used to pass the paths.
>
> Finally, the use of this transliterator increase inevitably the number
> of characters handled, I do not think it's a problem (there is 1044
> characters handled), on the contrary, and after several tests on index
> generations, I have no significant performance difference. Nonetheless,
> using the transliterator remains optional and a command line option is
> available to disable it (so one can easily generate a small rules file,
> if desired). It seemed however logical to me to keep it on by default:
> that is, a priori, the desired behavior.

+1

--=20
Thomas Munro
http://www.enterprisedb.com

Re: BUG #13440: unaccent does not remove all diacritics

От
Teodor Sigaev
Дата:
> I don't think this alone will completely get rid of the hardcoded
> special cases though, because we have these two mappings which look
> like Latin but are in fact Cyrillic and I assume we need to keep them:
>
> ³ å
> £ Å
>
As a native Russian speaker I can explain why we need to keep this two rules.
'³' letter is not 'E' with some accent/diacritic sign, it is a separate letter
in russian alphabet. But a lot of newpapers, magazines and even books use 'å'
instead of '³' to simplify printing house work. Any Russian speaker doesn't make
a mistake while reading because '³' isn't frequent and anybody remembers the
right pronounce. Also, on russian keyboard '³' placed in inconvenient place (key
with ` or ~), so, many russian writer use 'å' instead of it to increase typing
speed.

Pls, do not remove at least this special case.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
TL;DR: Special cases that were not handled by the new script were added.
All characters handled by unaccent are now handled by the script, and as
well new ones.

26/01/2016 00:44, Thomas Munro wrote:
> Wow.  It would indeed be nice to use this dataset rather than
> maintaining the special cases for œ et al.  It would also nice to pick
> up all those other things like ©, ½, …, ≪, ≫ (though these stray a
> little bit further from the functionality implied by unaccent's name).
It is true that the file grows in size and offers more and more
characters. But as Alvaro Herrera said in a previous mail:

“To me, conceptually what unaccent does is turn whatever junk you have
into a very basic common alphabet (ascii); then it's very easy to do
full text searches without having to worry about what accents the people
did or did not use in their searches.”

and I think it makes sense. And since there is no significant
performance difference, I think we can continue on this way.
> I don't think this alone will completely get rid of the hardcoded
> special cases though, because we have these two mappings which look
> like Latin but are in fact Cyrillic and I assume we need to keep them:
>
> Ё Е
> ё е
Regarding Cyrillic characters mentioned, I did not noticed. But yes, we
have to keep them (see Teodor Sigaev's message below). Furthermore, I
continued my research to see which characters was not handled yet, and
they are potentially multiple and it is not always clear whether they
should be. In particular, I found several characters in “Letterlike
Symbols” Unicode Block (U+2100 to U+214F) who were absent from the
transliterator (℃, ℉, etc.). So I changed the script to handle special
cases, and I added those I just mentioned (you will find attached the
new version of the script and the generated output for convenience).
>
> Should we extend the composition data analysis to make these remaining
> special cases go away?  We'd need a definition of is_plain_letter that
> returns True for 0415 so that 0401 can be recognised as 0415 + 0308.
> Depending on how you do that, you could sweep in some more Cyrillic
> mappings and a ton of stuff from other scripts that have precomposed
> diacritic codepoints (Greek, Hebrew, Arabic, ...?), and we'd need
> someone with knowledge of relevant languages to sign off on the result
> -- so it might make sense to stick to a definition that includes just
> Latin and Cyrillic for now.
>
> (Otherwise it might be tempting to use *only* the transliterator
> approach, but CLDR doesn't seem to have appropriate transliterator
> files for other scripts.  They have for example Cyrillic -> Latin, but
> we'd want Cyrillic -> some-subset-of-Cyrillic, analogous to Latin ->
> ASCII.)
>
Indeed, I added some special cases, but I doubt very much that it is
exhaustive. It would be good to find a way to avoid these cases.
Regarding the various solutions proposed, it may be possible to opt for
a hybrid one. For example, extend the analysis of the composition for
blocks when relevant (some characters mentioned above show that some are
not in transliterators), or use a transliterator when it's more
convenient (perhaps for Cyrillic, etc.).

You also right about to the fact that sometimes we must think for some
languages (and so we'd need someone with knowledge of these languages),
this is also true for some blocks for which we must decide whether to
include certain characters makes sense or not. I think, notably, about
the extended Latin blocks (Latin Extended-A, B, Additional, C, D, etc.)
which are yet ignored.

11/02/2016 16:36, Teodor Sigaev wrote:
>> I don't think this alone will completely get rid of the hardcoded
>> special cases though, because we have these two mappings which look
>> like Latin but are in fact Cyrillic and I assume we need to keep them:
>>
>> Ё Е
>> ё е
>>
> As a native Russian speaker I can explain why we need to keep this two
> rules.
> 'Ё' letter is not 'E' with some accent/diacritic sign, it is a
> separate letter in russian alphabet. But a lot of newpapers, magazines
> and even books use 'Е' instead of 'Ё' to simplify printing house work.
> Any Russian speaker doesn't make a mistake while reading because 'Ё'
> isn't frequent and anybody remembers the right pronounce. Also, on
> russian keyboard 'Ё' placed in inconvenient place (key with ` or ~),
> so, many russian writer use 'Е' instead of it to increase typing speed.
>
> Pls, do not remove at least this special case.
>
This case is now managed as a special case in the new version (see above).

Léonard Benedetti

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
11/02/2016 22:05, Léonard Benedetti wrote:
> So I changed the script to handle special cases, and I added those I
> just mentioned (you will find attached the new version of the script
> and the generated output for convenience).

Here is the patch, attached.

Léonard Benedetti

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Teodor Sigaev
Дата:
>> So I changed the script to handle special cases, and I added those I
>> just mentioned (you will find attached the new version of the script
>> and the generated output for convenience).
>
> Here is the patch, attached.

I'm inclining to commit this patch becouse it suggests more regular way to
update unaccent rules. That is nice.

But I have some notices:
1 Is it possible to do not restrict generator script to Python V2? Python V2,
seems, will go away in near future, and it will not be comfortable to install V2
for a single task.
2 As it's easy to see, nowhere in sources of pgsql there is no a UTF-8 encoding,
just ASCII. I don't see reason to make an exception for this script.

Thank you.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: BUG #13440: unaccent does not remove all diacritics

От
Bruce Momjian
Дата:
On Fri, Feb 12, 2016 at 07:44:22PM +0300, Teodor Sigaev wrote:
> >>So I changed the script to handle special cases, and I added those I
> >>just mentioned (you will find attached the new version of the script
> >>and the generated output for convenience).
> >
> >Here is the patch, attached.
>
> I'm inclining to commit this patch becouse it suggests more regular
> way to update unaccent rules. That is nice.
>
> But I have some notices:
> 1 Is it possible to do not restrict generator script to Python V2?
> Python V2, seems, will go away in near future, and it will not be
> comfortable to install V2 for a single task.
> 2 As it's easy to see, nowhere in sources of pgsql there is no a
> UTF-8 encoding, just ASCII. I don't see reason to make an exception
> for this script.

How would this affect pg_trgm indexes upgraded by pg_upgrade?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

Re: BUG #13440: unaccent does not remove all diacritics

От
Teodor Sigaev
Дата:
>> I'm inclining to commit this patch becouse it suggests more regular
>> way to update unaccent rules. That is nice.
> How would this affect pg_trgm indexes upgraded by pg_upgrade?

pg_upgrade doesn't know any tsearch configuration staff, so, it can't do
anything useful with checnges of fts configuration. The right way is recreate
all tsvector columns, but pg_upgrade obviously is not able to run tsearch stack:
parser, dictionaries etc.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: BUG #13440: unaccent does not remove all diacritics

От
Bruce Momjian
Дата:
On Wed, Feb 17, 2016 at 03:06:26PM +0300, Teodor Sigaev wrote:
> >>I'm inclining to commit this patch becouse it suggests more regular
> >>way to update unaccent rules. That is nice.
> >How would this affect pg_trgm indexes upgraded by pg_upgrade?
>
> pg_upgrade doesn't know any tsearch configuration staff, so, it
> can't do anything useful with checnges of fts configuration. The
> right way is recreate all tsvector columns, but pg_upgrade obviously
> is not able to run tsearch stack: parser, dictionaries etc.

OK, so would we need to mark such indexes as invalid?  Are you saying
the tsvector columns would also be invalid?  Yikes.  We could at least
tell people to recreate them.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
12/02/2016 17:44, Teodor Sigaev wrote :
> I'm inclining to commit this patch becouse it suggests more regular
> way to update unaccent rules. That is nice.
>
> But I have some notices:
> 1 Is it possible to do not restrict generator script to Python V2?
> Python V2, seems, will go away in near future, and it will not be
> comfortable to install V2 for a single task.

Yes I agree, it makes sense; the script was originally Python 2 but
Python 2 is legacy. Moreover, adapting the script for Python 3 seems
trivial.

> 2 As it's easy to see, nowhere in sources of pgsql there is no a UTF-8
> encoding, just ASCII. I don't see reason to make an exception for this
> script.

First of all, the majority of pgsql code is C, a language where default
encoding is not the same everywhere (may depend on the locale settings
or the compiler) so it is logical to use ASCII.

On the other hand, UTF-8 encoding for source code is *a feature of
Python 3* (to quote the documentation: “The default encoding for Python
source code is UTF-8”) so there is no possible ambiguity, and it will
not be a problem. That said, some non-ASCII characters may be removed
without prejudice from the source code of the script (I think in
particular to "“" and "”"). Nevertheless, for some comments, it would be
unfortunate (e.g. “# RegEx to parse rules (e.g. “Đ → D ; […]”)” or “# ℃
°C”).

>
> Thank you.
>

Thus, I propose to adapt the code to Python 3 (the encoding of the
script does not seem to be a problem for the above reasons). I try to do
it shortly.

Thank you for your feedback.

Léonard Benedetti




Re: BUG #13440: unaccent does not remove all diacritics

От
Teodor Sigaev
Дата:
>
> On the other hand, UTF-8 encoding for source code is *a feature of
> Python 3* (to quote the documentation: “The default encoding for Python
> source code is UTF-8”) so there is no possible ambiguity, and it will
> not be a problem. That said, some non-ASCII characters may be removed
> without prejudice from the source code of the script (I think in
> particular to "“" and "”"). Nevertheless, for some comments, it would be
> unfortunate (e.g. “# RegEx to parse rules (e.g. “Đ → D ; […]”)” or “# ℃
> °C”).
Ok, I didn't know that.


> Thus, I propose to adapt the code to Python 3 (the encoding of the
> script does not seem to be a problem for the above reasons). I try to do
> it shortly.
We are waiting...

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
Le 10/03/2016 14:46, Teodor Sigaev a écrit :
>>
>> On the other hand, UTF-8 encoding for source code is *a feature of
>> Python 3* (to quote the documentation: “The default encoding for Python
>> source code is UTF-8”) so there is no possible ambiguity, and it will
>> not be a problem. That said, some non-ASCII characters may be removed
>> without prejudice from the source code of the script (I think in
>> particular to "“" and "”"). Nevertheless, for some comments, it would be
>> unfortunate (e.g. “# RegEx to parse rules (e.g. “Đ → D ; […]”)” or “# ℃
>> °C”).
> Ok, I didn't know that.
>
>
>> Thus, I propose to adapt the code to Python 3 (the encoding of the
>> script does not seem to be a problem for the above reasons). I try to do
>> it shortly.
> We are waiting...
>
Sorry for the delay, adaptation to Python 3 was very easy (the code is
almost identical).

As usual, you will find attached the new version of the script and the
generated output for convenience.

Léonard Benedetti

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
10/03/2016 15:35, Léonard Benedetti wrote:
> Le 10/03/2016 14:46, Teodor Sigaev a écrit :
>>> On the other hand, UTF-8 encoding for source code is *a feature of
>>> Python 3* (to quote the documentation: “The default encoding for Python
>>> source code is UTF-8”) so there is no possible ambiguity, and it will
>>> not be a problem. That said, some non-ASCII characters may be removed
>>> without prejudice from the source code of the script (I think in
>>> particular to "“" and "”"). Nevertheless, for some comments, it would be
>>> unfortunate (e.g. “# RegEx to parse rules (e.g. “Đ → D ; […]”)” or “# ℃
>>> °C”).
>> Ok, I didn't know that.
>>
>>
>>> Thus, I propose to adapt the code to Python 3 (the encoding of the
>>> script does not seem to be a problem for the above reasons). I try to do
>>> it shortly.
>> We are waiting...
>>
> Sorry for the delay, adaptation to Python 3 was very easy (the code is
> almost identical).
>
> As usual, you will find attached the new version of the script and the
> generated output for convenience.
>
> Léonard Benedetti
Here is the patch, attached.

Léonard Benedetti

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Teodor Sigaev
Дата:
> Here is the patch, attached.

Hmm, now script doesn't work with Python 2:
% python2 -V
Python 2.7.11
% python2 contrib/unaccent/generate_unaccent_rules.py
   File "contrib/unaccent/generate_unaccent_rules.py", line 7
SyntaxError: Non-ASCII character '\xe2' in file
contrib/unaccent/generate_unaccent_rules.py on line 7, but no encoding declared;
see http://python.org/dev/peps/pep-0263/ for details

Is it intentional?

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
11/03/2016 17:38, Teodor Sigaev wrote:
>> Here is the patch, attached.
>
> Hmm, now script doesn't work with Python 2:
> % python2 -V
> Python 2.7.11
> % python2 contrib/unaccent/generate_unaccent_rules.py
>   File "contrib/unaccent/generate_unaccent_rules.py", line 7
> SyntaxError: Non-ASCII character '\xe2' in file
> contrib/unaccent/generate_unaccent_rules.py on line 7, but no encoding
> declared; see http://python.org/dev/peps/pep-0263/ for details
>
> Is it intentional?
>
In fact, Python 3 is not backward-compatible. This version brings many
new features which are not compatible with Python 2, so it is
intentional. To quote the documentation: “Python 3.0 […] is the first
ever intentionally backwards incompatible Python release”.

Despite all that, I think this transition to Python 3 is wise, it is
available since 2008. Python 2 is legacy and its last version (2.7) is a
“end-of-life release”.

Léonard Benedetti



Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Léonard Benedetti <benedetti@mlpo.fr> writes:
> Despite all that, I think this transition to Python 3 is wise, it is
> available since 2008. Python 2 is legacy and its last version (2.7) is a
> “end-of-life release”.

Doesn't matter.  We support both Python 2 and 3, and this script must
do so as well, else it's not getting committed.  Any desupport for
Python 2 in PG is very far away; no one has even suggested we consider
it yet.
        regards, tom lane



Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
11/03/2016 19:16, Tom Lane wrote:
> Léonard Benedetti <benedetti@mlpo.fr> writes:
>> Despite all that, I think this transition to Python 3 is wise, it is
>> available since 2008. Python 2 is legacy and its last version (2.7) is a
>> “end-of-life release”.
> Doesn't matter.  We support both Python 2 and 3, and this script must
> do so as well, else it's not getting committed.  Any desupport for
> Python 2 in PG is very far away; no one has even suggested we consider
> it yet.
>
>             regards, tom lane
Well, this is not a problem at all. We just need to have two scripts:
one for Python 2 and one for Python 3 (it is not possible to have one
script compatible for both versions insofar Python 3 is not
backward-compatible). Another possibility is to have just one script for
Python 2, but no support for Python 3.

Tell me what you prefer.

Léonard Benedetti



Re: BUG #13440: unaccent does not remove all diacritics

От
Peter Eisentraut
Дата:
On 3/11/16 1:16 PM, Tom Lane wrote:
> Léonard Benedetti <benedetti@mlpo.fr> writes:
>> Despite all that, I think this transition to Python 3 is wise, it is
>> available since 2008. Python 2 is legacy and its last version (2.7) is a
>> “end-of-life release�.
>
> Doesn't matter.  We support both Python 2 and 3, and this script must
> do so as well, else it's not getting committed.  Any desupport for
> Python 2 in PG is very far away; no one has even suggested we consider
> it yet.

This script is only run occasionally when the unaccent data needs to be
updated from Unicode data, so it's not really that important what
language and version it's written in.  That said, the mentioned reason
for changing this to Python 3 is so that one can include Unicode
characters into the source text, which I find undesirable in general
(for PostgreSQL source code) and not very useful in this particular
case.  I think the script can be kept in Python 2 style.  Making it
upward compatible with Python 3 can be a separate (small) project.




Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
12/03/2016 04:02, Peter Eisentraut wrote:
> On 3/11/16 1:16 PM, Tom Lane wrote:
>> Léonard Benedetti <benedetti@mlpo.fr> writes:
>>> Despite all that, I think this transition to Python 3 is wise, it is
>>> available since 2008. Python 2 is legacy and its last version (2.7) is a
>>> “end-of-life release�.
>> Doesn't matter.  We support both Python 2 and 3, and this script must
>> do so as well, else it's not getting committed.  Any desupport for
>> Python 2 in PG is very far away; no one has even suggested we consider
>> it yet.
> This script is only run occasionally when the unaccent data needs to be
> updated from Unicode data, so it's not really that important what
> language and version it's written in.  That said, the mentioned reason
> for changing this to Python 3 is so that one can include Unicode
> characters into the source text, which I find undesirable in general
> (for PostgreSQL source code) and not very useful in this particular
> case.  I think the script can be kept in Python 2 style.  Making it
> upward compatible with Python 3 can be a separate (small) project.
>
I completely agree. This script does not have to be run regularly (as
mentioned, just when the Unicode standard changes or characters of
transliterator). Moreover, even when it should be done, users can wait
for the next version of PostgreSQL where the rules file has already been
updated. So, it is indeed a one-time shot, and the language of this
script is not so important.

However, concerning support for Unicode characters into the source code,
version of Python does not change much (both versions support it). The
change to Python 3 was rather done to anticipate the end of life of
Python 2. But as has been pointed out by Tom Lane, it's not going to
happen shortly (according to the PEP 0373: “The current plan is to
support [Python 2] for at least 10 years from the initial 2.7 release.
This means there will be bugfix releases until 2020.”). Furthermore, as
I stated above, adaptation to Python 3 was quite trivial, and could be
made easily in due course.

So I think we can keep just a version for Python 2 for now. If everyone
agrees, I'll update the files and patch.

Léonard Benedetti



Re: BUG #13440: unaccent does not remove all diacritics

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> On 3/11/16 1:16 PM, Tom Lane wrote:
>> Doesn't matter.  We support both Python 2 and 3, and this script must
>> do so as well, else it's not getting committed.

> This script is only run occasionally when the unaccent data needs to be
> updated from Unicode data, so it's not really that important what
> language and version it's written in.

Oh, okay, I'd supposed this would be part of the build process.

> That said, the mentioned reason
> for changing this to Python 3 is so that one can include Unicode
> characters into the source text, which I find undesirable in general
> (for PostgreSQL source code) and not very useful in this particular
> case.  I think the script can be kept in Python 2 style.

Works for me, and I agree that Unicode in our source text is mostly
a pain.

            regards, tom lane

Re: BUG #13440: unaccent does not remove all diacritics

От
Teodor Sigaev
Дата:
> So I think we can keep just a version for Python 2 for now. If everyone
> agrees, I'll update the files and patch.

Attached patch is a my try to make script works for both 2 & 3 versions of
Python. At least it produces the same result for 2.7 and 3.4. Pls, could you
check? I'm not a Python developer at all.

BTW, I revomed unicode characted from code and leaved it only in comments.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Léonard Benedetti
Дата:
15/03/2016 18:01, Teodor Sigaev wrote:
>> So I think we can keep just a version for Python 2 for now. If everyone
>> agrees, I'll update the files and patch.
>
> Attached patch is a my try to make script works for both 2 & 3
> versions of Python. At least it produces the same result for 2.7 and
> 3.4. Pls, could you check? I'm not a Python developer at all.
>
> BTW, I revomed unicode characted from code and leaved it only in
> comments.
>
Unfortunately, this script is not functional: the characters managed by
“parse_cldr_latin_ascii_transliterator” are absent from output. It is
probably a compatibility problem with the regex (the two versions of the
language are not compatible, it is not always possible to write a code
that works with both).

After the various feedbacks, and since: the PostgreSQL source uses only
Python 2, the end of support for this version will not happen soon, and
mostly this script must be run very rarely (only when the Unicode
Standard is updated, or transliterator, it is not part of the build
process), the easiest way seems to be to have a single Python 2 script.

So, you will find attached a new patch, it’s the same script, compatible
with Python 2, *with only ASCII characters*.

Regards.

Léonard Benedetti

Вложения

Re: BUG #13440: unaccent does not remove all diacritics

От
Teodor Sigaev
Дата:
> So, you will find attached a new patch, it’s the same script, compatible
> with Python 2, *with only ASCII characters*.
Thank you very much, pushed. I changed only one thing:
character → in regex to \u2192
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/