Re: english parser in text search: support for multiple words in the same position

Поиск
Список
Период
Сортировка
От Sushant Sinha
Тема Re: english parser in text search: support for multiple words in the same position
Дата
Msg-id 1283323324.2084.22.camel@dragflick
обсуждение исходный текст
Ответ на Re: english parser in text search: support for multiple words in the same position  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: english parser in text search: support for multiple words in the same position  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
I have attached a patch that emits parts of a host token, a url token,
an email token and a file token. Further, it makes sure that a
host/url/email/file token and the first part-token are at the same
position in tsvector.

The two major changes are:

1. Tokenization changes: The patch exploits the special handlers in the
text parser to reset the parser position to the start of a
host/url/email/file token when it finds one. Special handlers were
already used for extracting host and urlpath from a full url. So this is
more of an extension of the same idea.

2. Position changes: We do not advance position when we encounter a
host/url/email/file token. As a result the first part of that token
aligns with the token itself.

Attachments:

tokens_output.txt: sample queries and results with the patch
token_v1.patch:    patch wrt cvs head

Currently, the patch output parts of the tokens as normal tokens like
WORD, NUMWORD etc. Tom argued earlier that this will break
backward-compatibility and so it should be outputted as parts of the
respective tokens. If there is an agreement over what Tom says, then the
current patch can be modified to output subtokens as parts. However,
before I complicate the patch with that, I wanted to get feedback on any
other major problem with the patch.

-Sushant.

On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote:
> Sushant Sinha <sushant354@gmail.com> writes:
> >> This would needlessly increase the number of tokens. Instead you'd
> >> better make it work like compound word support, having just "wikipedia"
> >> and "org" as tokens.
>
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal english words as well so that if someone types only "wikipedia"
> > he gets a match.
>
> The suggestion to make it work like compound words is still a good one,
> ie given wikipedia.org you'd get back
>
>     host        wikipedia.org
>     host-part    wikipedia
>     host-part    org
>
> not just the "host" token as at present.
>
> Then the user could decide whether he needed to index hostname
> components or not, by choosing whether to forward hostname-part
> tokens to a dictionary or just discard them.
>
> If you submit a patch that tries to force the issue by classifying
> hostname parts as plain words, it'll probably get rejected out of
> hand on backwards-compatibility grounds.
>
>             regards, tom lane


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "David E. Wheeler"
Дата:
Сообщение: array_agg() NULL Handling
Следующее
От: Thom Brown
Дата:
Сообщение: Re: array_agg() NULL Handling