Re: english parser in text search: support for multiple words in the same position
От | Sushant Sinha |
---|---|
Тема | Re: english parser in text search: support for multiple words in the same position |
Дата | |
Msg-id | 1283323324.2084.22.camel@dragflick обсуждение исходный текст |
Ответ на | Re: english parser in text search: support for multiple words in the same position (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: english parser in text search: support for multiple
words in the same position
|
Список | pgsql-hackers |
I have attached a patch that emits parts of a host token, a url token, an email token and a file token. Further, it makes sure that a host/url/email/file token and the first part-token are at the same position in tsvector. The two major changes are: 1. Tokenization changes: The patch exploits the special handlers in the text parser to reset the parser position to the start of a host/url/email/file token when it finds one. Special handlers were already used for extracting host and urlpath from a full url. So this is more of an extension of the same idea. 2. Position changes: We do not advance position when we encounter a host/url/email/file token. As a result the first part of that token aligns with the token itself. Attachments: tokens_output.txt: sample queries and results with the patch token_v1.patch: patch wrt cvs head Currently, the patch output parts of the tokens as normal tokens like WORD, NUMWORD etc. Tom argued earlier that this will break backward-compatibility and so it should be outputted as parts of the respective tokens. If there is an agreement over what Tom says, then the current patch can be modified to output subtokens as parts. However, before I complicate the patch with that, I wanted to get feedback on any other major problem with the patch. -Sushant. On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote: > Sushant Sinha <sushant354@gmail.com> writes: > >> This would needlessly increase the number of tokens. Instead you'd > >> better make it work like compound word support, having just "wikipedia" > >> and "org" as tokens. > > > The current text parser already returns url and url_path. That already > > increases the number of unique tokens. I am only asking for adding of > > normal english words as well so that if someone types only "wikipedia" > > he gets a match. > > The suggestion to make it work like compound words is still a good one, > ie given wikipedia.org you'd get back > > host wikipedia.org > host-part wikipedia > host-part org > > not just the "host" token as at present. > > Then the user could decide whether he needed to index hostname > components or not, by choosing whether to forward hostname-part > tokens to a dictionary or just discard them. > > If you submit a patch that tries to force the issue by classifying > hostname parts as plain words, it'll probably get rejected out of > hand on backwards-compatibility grounds. > > regards, tom lane
Вложения
В списке pgsql-hackers по дате отправления: