Re: tsearch2 keep throw-away characters

Поиск
Список
Период
Сортировка
От Ivan Zolotukhin
Тема Re: tsearch2 keep throw-away characters
Дата
Msg-id 751e56400705192234t33abf55s44e2f3aa7c6746ac@mail.gmail.com
обсуждение исходный текст
Ответ на tsearch2 keep throw-away characters  (Kimball <kbighorse@gmail.com>)
Список pgsql-general
Hello,

Your problem is not about stop words, it's about the fact that tsearch
parser treats '+' and '#' symbols as a lexemes of a blank type (use
ts_debug() function to figure it out) and drops it without any further
processing. AFAIK, typical solution for this is to rewrite your text
and then queries to some auxiliary words, like 'SYScpp' and
'SYScsharp', that will be included in tsvectors and indexed without
any problems. Usually you can do replacements in tsvector trigger when
indexing documents and via query rewriting (in tsearch or your
application) when quering database.

Trivial examples:

test=# select to_tsvector('english','I know how to code in SYScsharp,
java and SYScpp');
                     to_tsvector
------------------------------------------------------
 'code':5 'java':8 'know':2 'syscpp':10 'syscsharp':7
(1 row)

and, sure:

test=# select 'I know how to code in SYScsharp, java and SYScpp' @@ 'SYScpp';
 ?column?
----------
 t
(1 row)

There might be more sophisticated solution like prevent parser from
treating '++' as a blank lexemes, but Oleg will explain this much
better, as soon as he has time.

--
Regards,
Ivan


On 5/16/07, Kimball <kbighorse@gmail.com> wrote:
>
> postgres=# select to_tsvector('default','I know how to code in C#, java and
> C++');
>               to_tsvector
> -------------------------------------
>  'c':7,10 'code':5 'java':8 'know':2
>  (1 row)
>
> postgres=# select to_tsvector('simple','I know how to code in C#, java and
> C++');
>                                to_tsvector
> -------------------------------------------------------------------------
>  'c':7,10 'i':1 'in':6 'to':4 'and':9 'how':3 'code':5 'java':8 'know':2
> (1 row)
>
>
> I'd like to get lexemes/tokens 'c#' and 'c++' out of this query.  Everything
> I can find has to do with stop words.   How do I keep characters that
> tsearch throws out?  I've already tried 'c\#' and 'c\\#' etc, which don't
> work.
>
> Kimball

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: FULL JOIN is only supported with merge-joinable join conditions
Следующее
От: novnov
Дата:
Сообщение: Trigger function which inserts into table; values from lookup