Обсуждение: full-text search question

Поиск
Список
Период
Сортировка

full-text search question

От
Sabbiolina
Дата:

Hello,

 

I've seen that the default parser for the full-text search can identify e-mail addresses, hosts, URLs… but I have a serious problem with it:

 

Suppose I index the following sentence "the search engine I use the most is www.google.com"

 

And I search "google" no result is found.

Instead if I search "www.google.com" the record is found correctly.

 

I guess the reason is because the parser treats www.google.com as a single token (of type 'host') but as everyone can easily see the result of this is a major problem. In fact the word "google" actually is in the above sentence, and the end-user of the database obviously asks me "why does your FTS not find that record when I can clearly see that my search term is there?"

 

Reading the docs I've seen that the parser can produce multiple tokens for the same word (for example the word "make-up" produces 4 tokens: make-up, make, -, up)… why not doing the same with URLs and e-mails? Why www.google.com is only treated as a unique word? Why not producing multiple tokens like www.google.com, www, ., google, ., com? (obviously www and . can be nulled or stopworded).

 Does anybody know of a better parser for Postgres? Or at least a trick to make its FTS find the record above by searching only a part of the URL?

Re: full-text search question

От
Oleg Bartunov
Дата:
Sabbiolina,

you have two options:

1. Write you very own parser
2. Write dictionary, which breaks host to parts

Fortunately, you can use our dict_regex dictionary
(http://vo.astronet.ru/arxiv/dict_regex.html) instead of 2.

Oleg

On Wed, 18 Jun 2008, Sabbiolina wrote:

> Hello,
>
>
>
> I've seen that the default parser for the full-text search can identify
> e-mail addresses, hosts, URLs? but I have a serious problem with it:
>
>
>
> Suppose I index the following sentence "the search engine I use the most is
> www.google.com"
>
>
>
> And I search "google" no result is found.
>
> Instead if I search "www.google.com" the record is found correctly.
>
>
>
> I guess the reason is because the parser treats www.google.com as a single
> token (of type 'host') but as everyone can easily see the result of this is
> a major problem. In fact the word "google" actually is in the above
> sentence, and the end-user of the database obviously asks me "why does your
> FTS not find that record when I can clearly see that my search term is
> there?"
>
>
>
> Reading the docs I've seen that the parser can produce multiple tokens for
> the same word (for example the word "make-up" produces 4 tokens: make-up,
> make, -, up)? why not doing the same with URLs and e-mails? Why
> www.google.com is only treated as a unique word? Why not producing multiple
> tokens like www.google.com, www, ., google, ., com? (obviously www and . can
> be nulled or stopworded).
>
>
> Does anybody know of a better parser for Postgres? Or at least a trick to
> make its FTS find the record above by searching only a part of the URL?
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: full-text search question

От
Andrew Sullivan
Дата:
On Wed, Jun 18, 2008 at 02:49:48PM +0200, Sabbiolina wrote:
> www.google.com is only treated as a unique word? Why not producing multiple
> tokens like www.google.com, www, ., google, ., com? (obviously www and . can
> be nulled or stopworded).

You wouldn't want to get the token ".".  It's not a token, but a label
boundary.  So in your analogy of treating the labels in a FQDN as
"words", the "." needs to be treated the way spaces are between words.

A

--
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/