Обсуждение: to_ascii, or some other form of magic transliteration

Поиск
Список
Период
Сортировка

to_ascii, or some other form of magic transliteration

От
Ben
Дата:
I'm working on a problem that I imagine others have had, which basically
boils down to having nice unicode display text that users are going to
want to search against without typing it correctly.... e.g. let a search
for "sma" match "små". It seems like the best way to do this is to find
a magic unicode transliteration mapping function, and then save the
ASCII transliterations for searching against.

I see there's a function to_ascii, which sounds hopeful. However, when I
try to use it, I get back:

ERROR:  encoding conversion from UNICODE to ASCII not supported

What is this function for, if not to convert other encodings to ASCII?
Is there some other way to do what I'm asking for?

Re: to_ascii, or some other form of magic transliteration

От
Mike Rylander
Дата:
On 9/9/05, Ben <bench@silentmedia.com> wrote:
> I'm working on a problem that I imagine others have had, which basically
> boils down to having nice unicode display text that users are going to
> want to search against without typing it correctly.... e.g. let a search
> for "sma" match "små". It seems like the best way to do this is to find
> a magic unicode transliteration mapping function, and then save the
> ASCII transliterations for searching against.
>

The simplest solution to this that I've found is to maintain a
separate column for ASCII-ized version of your text.  The conversion
can be done automatically using a trigger, and I have one in PL/PERLU
that I use.  It basically boils down to:

1) transform unicode text to normal form D
2) strip combining non-spacing marks

In modern Perls that looks like:

#--------------
use Unicode::Normalize;
my $txt = NFD(shift());
$txt =~ s/\pM//og;
return $txt;
#--------------

Hope that helps!

> I see there's a function to_ascii, which sounds hopeful. However, when I
> try to use it, I get back:
>
> ERROR:  encoding conversion from UNICODE to ASCII not supported
>
> What is this function for, if not to convert other encodings to ASCII?
> Is there some other way to do what I'm asking for?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>


--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: to_ascii, or some other form of magic transliteration

От
Ben
Дата:
Hrm, I must be missing something, because I don't see how this will
transliterate to ASCII?

On Sep 10, 2005, at 5:30 AM, Mike Rylander wrote:

> On 9/9/05, Ben <bench@silentmedia.com> wrote:
>
>> I'm working on a problem that I imagine others have had, which
>> basically
>> boils down to having nice unicode display text that users are
>> going to
>> want to search against without typing it correctly.... e.g. let a
>> search
>> for "sma" match "små". It seems like the best way to do this is to
>> find
>> a magic unicode transliteration mapping function, and then save the
>> ASCII transliterations for searching against.
>>
>>
>
> The simplest solution to this that I've found is to maintain a
> separate column for ASCII-ized version of your text.  The conversion
> can be done automatically using a trigger, and I have one in PL/PERLU
> that I use.  It basically boils down to:
>
> 1) transform unicode text to normal form D
> 2) strip combining non-spacing marks
>
> In modern Perls that looks like:
>
> #--------------
> use Unicode::Normalize;
> my $txt = NFD(shift());
> $txt =~ s/\pM//og;
> return $txt;
> #--------------
>
> Hope that helps!
>
>

Re: to_ascii, or some other form of magic transliteration

От
Mike Rylander
Дата:
On 9/10/05, Ben <bench@silentmedia.com> wrote:
> Hrm, I must be missing something, because I don't see how this will
> transliterate to ASCII?

If you want non-western text to be Romanized you can take a look at
Text::Unicode(1).  The functionality in the chunk of perl I sent
before was stripping of non spacing mark (accents, rings, umlauts and
such).  You may need to strip other character classes if you've got
unicode punctuation codepoints in the text to be searched.

For the example you gave, the process is to decompose the character
"å" to normalization form D, "a" and the unicode non spacing mark for
the ring, and then removing the non spacing mark (the ring diacritic)
with the regex s/\pM//sog.  That will leave just the ASCII "a" in the
text, and the text can the be treated as pure ASCII, because it no
longer contains any unicode codepoints with an ord() above 127.  You
may want to look here(2) for an explanation and examples of Unicode
normalization forms.

If you don't need that much functionality (handling arbitrary unicode
text), and you're dealing strictly with the Latin1 subset of unicode,
you can just create a mapping table or hash to transliterate down to
ASCII, as done here(3).



1) http://cpan.uwinnipeg.ca/htdocs/Text-Unidecode/Text/Unidecode.html
2) http://www.unicode.org/unicode/reports/tr15/#Canonical_Composition_Examples
3) http://www.eprints.org/files/eprints2/eprints-2.2/defaultcfg/ArchiveTextIndexingConfig.pm

>
> On Sep 10, 2005, at 5:30 AM, Mike Rylander wrote:
>
> > On 9/9/05, Ben <bench@silentmedia.com> wrote:
> >
> >> I'm working on a problem that I imagine others have had, which
> >> basically
> >> boils down to having nice unicode display text that users are
> >> going to
> >> want to search against without typing it correctly.... e.g. let a
> >> search
> >> for "sma" match "små". It seems like the best way to do this is to
> >> find
> >> a magic unicode transliteration mapping function, and then save the
> >> ASCII transliterations for searching against.
> >>
> >>
> >
> > The simplest solution to this that I've found is to maintain a
> > separate column for ASCII-ized version of your text.  The conversion
> > can be done automatically using a trigger, and I have one in PL/PERLU
> > that I use.  It basically boils down to:
> >
> > 1) transform unicode text to normal form D
> > 2) strip combining non-spacing marks
> >
> > In modern Perls that looks like:
> >
> > #--------------
> > use Unicode::Normalize;
> > my $txt = NFD(shift());
> > $txt =~ s/\pM//og;
> > return $txt;
> > #--------------
> >
> > Hope that helps!
> >
> >
>


--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org