Обсуждение: Remove diacritical marks in SQL

Поиск
Список
Период
Сортировка

Remove diacritical marks in SQL

От
Jiří Němec
Дата:
Hello,

I would like to remove diacritical marks from a string in a SQL query.
I tried to convert a UTF8 string to ASCII but it doesn't work for me.

SELECT convert('ěščřžýáíé','UTF8','SQL_ASCII')

array(1) {
  ["convert"]=>
  string(18) "ěščřžýáíé"
}

Thanks for any advice,

J.N.


Re: Remove diacritical marks in SQL

От
Martijn van Oosterhout
Дата:
On Fri, Jan 12, 2007 at 10:16:22PM +0100, Ji?í N?mec wrote:
> Hello,
>
> I would like to remove diacritical marks from a string in a SQL query.
> I tried to convert a UTF8 string to ASCII but it doesn't work for me.
>
> SELECT convert('?????ýáíé','UTF8','SQL_ASCII')

I don't think postgres has any stuff builtin for that, but other
languages (like perl) have modules to do this kind of thing. The method
is to decompose the string to normal form D, strip the diacritics, and
recompose what's left.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: Remove diacritical marks in SQL

От
Michael Fuhr
Дата:
On Fri, Jan 12, 2007 at 10:58:36PM +0100, Martijn van Oosterhout wrote:
> On Fri, Jan 12, 2007 at 10:16:22PM +0100, Jiří Němec wrote:
> > I would like to remove diacritical marks from a string in a SQL query.
> > I tried to convert a UTF8 string to ASCII but it doesn't work for me.
> >
> > SELECT convert('ěščřžýáíé','UTF8','SQL_ASCII')
>
> I don't think postgres has any stuff builtin for that, but other
> languages (like perl) have modules to do this kind of thing. The method
> is to decompose the string to normal form D, strip the diacritics, and
> recompose what's left.

A technique that's been posted before might work:

SELECT to_ascii(convert('ěščřžýáíé', 'LATIN2'), 'LATIN2');
 to_ascii
-----------
 escrzyaie
(1 row)

to_ascii() supports only LATIN1, LATIN2, LATIN9, and WIN1250 so you
have to convert to one of those encodings first.

As Martijn suggested, you could use Perl.  Here's an example with
Text::Unaccent (you'll need to use encoding names that iconv
recognizes):

CREATE FUNCTION unaccent(charset text, string text) RETURNS text AS $$
use Text::Unaccent;
return unac_string($_[0], $_[1]);
$$ LANGUAGE plperlu IMMUTABLE STRICT;

SELECT unaccent('UTF-8', 'ěščřžýáíé');
 unaccent
-----------
 escrzyaie
(1 row)

Here's an example that uses Unicode::Normalize to strip non-spacing
and enclosing marks:

CREATE FUNCTION unaccent(string text) RETURNS text AS $$
use Unicode::Normalize;
my $nfd_string = NFD($_[0]);
$nfd_string =~ s/[\p{Mn}\p{Me}]//g;
return NFC($nfd_string);
$$ LANGUAGE plperlu IMMUTABLE STRICT;

SELECT unaccent('ěščřžýáíé');
 unaccent
-----------
 escrzyaie
(1 row)

--
Michael Fuhr