Thanks everyone, I've been comparing the behavior to that of
https://github.com/andrewrk/node-diacritics/blob/master/index.js if that
can be of any help.
On Monday, June 15, 2015, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
> On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl@sss.pgh.pa.us
> <javascript:;>> wrote:
> > Alvaro Herrera <alvherre@2ndquadrant.com <javascript:;>> writes:
> >> My terminal shows these characters to be different. One is
> >> http://graphemica.com/%C8%9B
> >> latin small letter t with comma below (U+021B)
> >
> >> The other is
> >> http://graphemica.com/%C5%A3
> >> latin small letter t with cedilla (U+0163)
> >
> > Ah-hah -- I did not look closely enough. So the immediate answer for
> > Michael is to add another entry to his unaccent.rules file.
> >
> > Should we add the missing character to the standard unaccent.rules file=
?
>
> It looks like Romanian also has s with comma. Perhaps we should have
> all these characters:
>
> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
> 702
>
> That's quite a lot more than the 187 we currently have. Of those, I
> think only the following ligature characters don't fit the above
> pattern: =C3=86, =C3=A6, =C4=B2, =C4=B3, =C5=92, =C5=93, =C3=9F. Inciden=
tally, I don't believe that the
> way we "unaccent" ligatures is correct anyway. Maybe they should be
> expanded to AE, ae, IJ, ij, OE, oe, ss, respectively, not A, a, I, i,
> O, o, S as we have it, but I guess it depends what the purpose of
> unaccent is...
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
--=20
Cheers,
Mike
--=20
Mike Gradek
Co-founder and CTO, Busbud
Busbud.com <http://busbud.com/> | mike@busbud.com
*We're hiring!: Jobs at Busbud <http://www.busbud.com/en/about/jobs>*