Обсуждение: Full text index without accents
Hi again: I am trying to create a full text configuration to ignore word accents in my searches. My approach is similar to simple dicionary one, but i want to remove accents after converting to lower. Is it the only way to do it to develop another .c and write my own dict_noaccent.c, and then compile and install it into the system? Regars, Mario Barcala
You can preprocess text (replace accent by nothing) before
to_tsvector or to_tsquery
Oleg
On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:
> Hi again:
>
> I am trying to create a full text configuration to ignore word accents in
> my searches. My approach is similar to simple dicionary one, but i want to
> remove accents after converting to lower.
>
> Is it the only way to do it to develop another .c and write my own
> dict_noaccent.c, and then compile and install it into the system?
>
> Regars,
>
> Mario Barcala
>
>
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
And which are the types of argument and returning values of a pl/sql
function which preprocess de text?
I have been searching that, for example, something like this works fine:
CREATE INDEX textindex ON document USING
gin(to_tsvector('english',upper(text)));
where text is the text column of document. But I have tried to do
something like:
CREATE INDEX textindex ON document USING
gin(to_tsvector('english',myfunction(text)));
where myfunction is a PL/SQL function which call upper one, but I didn't
find which are the types of the myfunction argument and returning value.
I am a PL/SQL novice and I didn't find how to do it yet. Of course, then
I will have to change upper experiment to my objective: to index without
accents. I don't know if PL/SQL is the better option to build such
function.
Thanks,
Mario Barcala
> You can preprocess text (replace accent by nothing) before
> to_tsvector or to_tsquery
>
>
>
> Oleg
> On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:
>
> > Hi again:
> >
> > I am trying to create a full text configuration to ignore word accents in
> > my searches. My approach is similar to simple dicionary one, but i want to
> > remove accents after converting to lower.
> >
> > Is it the only way to do it to develop another .c and write my own
> > dict_noaccent.c, and then compile and install it into the system?
> >
> > Regars,
> >
> > Mario Barcala
> >
> >
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
Here is an example
CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;
arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
to_tsvector
-------------------------
'oleg':1 'sai.msu.su':2
On Tue, 22 Jul 2008, Fco. Mario Barcala Rodr?guez wrote:
> And which are the types of argument and returning values of a pl/sql
> function which preprocess de text?
>
> I have been searching that, for example, something like this works fine:
>
> CREATE INDEX textindex ON document USING
> gin(to_tsvector('english',upper(text)));
>
> where text is the text column of document. But I have tried to do
> something like:
>
> CREATE INDEX textindex ON document USING
> gin(to_tsvector('english',myfunction(text)));
>
> where myfunction is a PL/SQL function which call upper one, but I didn't
> find which are the types of the myfunction argument and returning value.
>
> I am a PL/SQL novice and I didn't find how to do it yet. Of course, then
> I will have to change upper experiment to my objective: to index without
> accents. I don't know if PL/SQL is the better option to build such
> function.
>
> Thanks,
>
> Mario Barcala
>
>> You can preprocess text (replace accent by nothing) before
>> to_tsvector or to_tsquery
>>
>>
>>
>> Oleg
>> On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:
>>
>>> Hi again:
>>>
>>> I am trying to create a full text configuration to ignore word accents in
>>> my searches. My approach is similar to simple dicionary one, but i want to
>>> remove accents after converting to lower.
>>>
>>> Is it the only way to do it to develop another .c and write my own
>>> dict_noaccent.c, and then compile and install it into the system?
>>>
>>> Regars,
>>>
>>> Mario Barcala
>>>
>>>
>>>
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
>
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Finally I create a function like:
CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$
DECLARE
var1 varchar;
BEGIN
var1=replace($1, 'á', 'a');
var1=replace(var1, 'é', 'e');
var1=replace(var1, 'í', 'i');
var1=replace(var1, 'ó', 'o');
var1=replace(var1, 'ú', 'u');
var1=replace(var1, 'Á', 'A');
var1=replace(var1, 'É', 'E');
var1=replace(var1, 'Í', 'I');
var1=replace(var1, 'Ó', 'O');
var1=replace(var1, 'Ú', 'U');
return var1;
END
$$LANGUAGE plpgsql immutable;
Then, create text indexes, one for sensible queries and other for
unsensible ones:
CREATE INDEX textindex ON document USING
gin(to_tsvector('spanish',text));
CREATE INDEX textindexn ON document USING
gin(to_tsvector('spanish',nonsensible(text)));
And then make a query sensible or unsensible to accents doing:
SELECT id FROM document WHERE to_tsvector('spanish',text) @@
to_tsquery('spanish','word_with_accent');
or:
SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text))
@@ to_tsquery('spanish',nonsensible('word_with_accent'));
respectively.
I think postgreSQL uses both indexes as necessary. I believe to remember
reading something about it in the documentation.
Thank you very much,
Mario Barcala
> Here is an example
>
> CREATE FUNCTION dropatsymbol(text) RETURNS text
> AS 'select replace($1, ''@'', '' '');'
> LANGUAGE SQL;
>
> arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
> to_tsvector
> -------------------------
> 'oleg':1 'sai.msu.su':2
This would probably help:
CREATE OR REPLACE FUNCTION norm_text_latin(character varying)
RETURNS character varying AS
$BODY$
declare
p_str alias for $1;
v_str varchar;
begin
select translate(p_str, 'ÀÁÂÃÄÅ', 'AAAAAA') into v_str;
select translate(v_str, 'ÉÈËÊ', 'EEEE') into v_str;
select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str;
select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str;
select translate(v_str, 'ÒÓÔÕÖ', 'OOOOO') into v_str;
select translate(v_str, 'ÙÚÛÜ', 'UUUU') into v_str;
select translate(v_str, 'àáâãäå', 'aaaaaa') into v_str;
select translate(v_str, 'èéêë', 'eeee') into v_str;
select translate(v_str, 'ìíîï', 'iiii') into v_str;
select translate(v_str, 'òóôõö', 'ooooo') into v_str;
select translate(v_str, 'ùúûü', 'uuuu') into v_str;
select translate(v_str, 'Çç', 'Cc') into v_str;
return v_str;
end;$BODY$
LANGUAGE 'plpgsql' VOLATILE;
There's also o useful functions here:
http://www.project-open.org/doc/intranet-search-pg/intranet-search-pg-create.sql
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Fco. Mario Barcala
Rodríguez
Sent: July 24, 2008 4:47 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Full text index without accents
Finally I create a function like:
CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$
DECLARE
var1 varchar;
BEGIN
var1=replace($1, 'á', 'a');
var1=replace(var1, 'é', 'e');
var1=replace(var1, 'í', 'i');
var1=replace(var1, 'ó', 'o');
var1=replace(var1, 'ú', 'u');
var1=replace(var1, 'Á', 'A');
var1=replace(var1, 'É', 'E');
var1=replace(var1, 'Í', 'I');
var1=replace(var1, 'Ó', 'O');
var1=replace(var1, 'Ú', 'U');
return var1;
END
$$LANGUAGE plpgsql immutable;
Then, create text indexes, one for sensible queries and other for
unsensible ones:
CREATE INDEX textindex ON document USING
gin(to_tsvector('spanish',text));
CREATE INDEX textindexn ON document USING
gin(to_tsvector('spanish',nonsensible(text)));
And then make a query sensible or unsensible to accents doing:
SELECT id FROM document WHERE to_tsvector('spanish',text) @@
to_tsquery('spanish','word_with_accent');
or:
SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text))
@@ to_tsquery('spanish',nonsensible('word_with_accent'));
respectively.
I think postgreSQL uses both indexes as necessary. I believe to remember
reading something about it in the documentation.
Thank you very much,
Mario Barcala
> Here is an example
>
> CREATE FUNCTION dropatsymbol(text) RETURNS text
> AS 'select replace($1, ''@'', '' '');'
> LANGUAGE SQL;
>
> arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
> to_tsvector
> -------------------------
> 'oleg':1 'sai.msu.su':2
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jonathan Bond-Caron wrote: > This would probably help: > > CREATE OR REPLACE FUNCTION norm_text_latin(character varying) > RETURNS character varying AS > $BODY$ > declare > p_str alias for $1; > v_str varchar; > begin > select translate(p_str, 'ÀÁÂÃÄÅ', 'AAAAAA') into v_str; Hmm, why not simply use to_ascii() ? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Hmm, why not simply use to_ascii() ?
The big problem with to_ascii is its inadequate set of supported
encodings. Somebody *really* needs to give it some love on that
front.
regards, tom lane
Ya the function name norm_text_latin() was probably misleading, it takes
latin1-ish characters *encoded in UTF8* and brings them to ascii.
Definitely, the following would be much simpler:
SELECT to_ascii('ÀÁÂÃÄÅÒÓÔÕÖ', 'UTF8')
As of 8.3, you have to do some magic with to_ascii() and utf8 characters
SELECT to_ascii(convert_to_latin('ÀÁÂÃÄÅÒÓÔÕÖ'), 'LATIN1')
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: July 25, 2008 10:42 AM
To: Alvaro Herrera
Cc: Jonathan Bond-Caron; 'Fco. Mario Barcala Rodríguez';
pgsql-general@postgresql.org
Subject: Re: [GENERAL] Full text index without accents
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Hmm, why not simply use to_ascii() ?
The big problem with to_ascii is its inadequate set of supported
encodings. Somebody *really* needs to give it some love on that
front.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general