Re: english parser in text search: support for multiple words in the same position

Поиск
Список
Период
Сортировка
От Sushant Sinha
Тема Re: english parser in text search: support for multiple words in the same position
Дата
Msg-id 1280754770.1769.8.camel@dragflick
обсуждение исходный текст
Ответ на Re: english parser in text search: support for multiple words in the same position  (Markus Wanner <markus@bluegap.ch>)
Ответы Re: english parser in text search: support for multiple words in the same position  (Markus Wanner <markus@bluegap.ch>)
Re: english parser in text search: support for multiple words in the same position  (Robert Haas <robertmhaas@gmail.com>)
Re: english parser in text search: support for multiple words in the same position  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
> On 08/01/2010 08:04 PM, Sushant Sinha wrote:
> > 1. We do not have separate tokens "wikipedia" and "org"
> > 2. If we have the two tokens we should have them at adjacent position so
> > that a phrase search for "wikipedia org" should work.
> 
> This would needlessly increase the number of tokens. Instead you'd 
> better make it work like compound word support, having just "wikipedia" 
> and "org" as tokens.

The current text parser already returns url and url_path. That already
increases the number of unique tokens. I am only asking for adding of
normal english words as well so that if someone types only "wikipedia"
he gets a match. 

> 
> Searching for "wikipedia.org" or "wikipedia org" should then result in 
> the same search query with the two tokens: "wikipedia" and "org".

Earlier people have expressed the need to index urls/emails and
currently the text parser already does so. Reverting that would be a
regression of functionality. Further, a ranking function can take
advantage of direct match of a token.

> > position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)
> 
> IMO the differentiation between WORDs and URLs is not something the text 
> search engine should have to take care a lot. Let it just do the 
> searching and make it do that well.

Postgres english parser already emits urls as tokens. Only thing I am
asking is on improving the tokenization and positioning.

> What does a token "wikipedia.org/search?q=sushant" buy you in terms of 
> text searching? Or even result highlighting? I wouldn't expect anybody 
> to want to search for a full URL, do you?

There have been need expressed in past. And an exact token match can
result in better ranking functions. For example, a tf-idf ranking will
rank matching of such unique tokens significantly higher.

-Sushant.

> Regards
> 
> Markus Wanner




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Synchronous replication
Следующее
От: Yeb Havinga
Дата:
Сообщение: Re: patch for check constraints using multiple inheritance