Re: english parser in text search: support for multiple words in the same position

Поиск
Список
Период
Сортировка
От Sushant Sinha
Тема Re: english parser in text search: support for multiple words in the same position
Дата
Msg-id AANLkTimOL8U68putCUUyhTnABbXu4pGpx740T4ENb-Wf@mail.gmail.com
обсуждение исходный текст
Ответ на Re: english parser in text search: support for multiple words in the same position  (Sushant Sinha <sushant354@gmail.com>)
Ответы Re: english parser in text search: support for multiple words in the same position  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Any updates on this?<br /><br /><br /><div class="gmail_quote">On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha <span
dir="ltr"><<ahref="mailto:sushant354@gmail.com">sushant354@gmail.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left:
1ex;"><divclass="im">> I looked at this patch a bit.  I'm fairly unhappy that it seems to be<br /> > inventing a
brandnew mechanism to do something the ts parser can<br /> > already do.  Why didn't you code the url-part mechanism
usingthe<br /> > existing support for compound words?<br /><br /></div>I am not familiar with compound word
implementationand so I am not sure<br /> how to split a url with compound word support. I looked into the<br />
documentationfor compound words and that does not say much about how to<br /> identify components of a token. Does a
compoundword split by matching<br /> with a list of words? If yes, then we will not be able to use that as we<br /> do
notknow all the words that can appear in a url/host/email/file.<br /><br /> I think another approach can be to use the
dict_regexdictionary<br /> support. However, we will have to match the regex with something that<br /> parser is
doing.<br/><br /> The current patch is not inventing any new mechanism. It uses the<br /> special handler mechanism
alreadypresent in the parser. For example,<br /> when the current parser finds a URL it runs a special handler
called<br/> SpecialFURL which resets the parser position to the start of token to<br /> find hostname. After finding
thehost it moves to finding the path. So<br /> you first get the URL and then the host and finally the path.<br /><br
/>Similarly, we are resetting the parser to the start of the token on<br /> finding a url to output url parts. Then
beforeentering the state that<br /> can lead to a url we output the url part. The state machine modification<br /> is
similarfor other tokens like file/email/host.<br /><div class="im"><br /><br /> > The changes made to parsetext()<br
/>> seem particularly scary: it's not clear at all that that's not breaking<br /> > unrelated behaviors.  In
fact,the changes in the regression test<br /> > results suggest strongly to me that it *is* breaking things.  Why
are<br/> > there so many diffs in examples that include no URLs at all?<br /> ><br /><br /></div>I think some of
thedifference is coming from the fact that now pos<br /> starts with 0 and it used to be 1 earlier. That is easily
fixable<br/> though.<br /><div class="im"><br /> > An issue that's nearly as bad is the 100% lack of
documentation,<br/> > which makes the patch difficult to review because it's hard to tell<br /> > what it intends
toaccomplish or whether it's met the intent.<br /> > The patch is not committable without documentation anyway, but
right<br/> > now I'm not sure it's even usefully reviewable.<br /><br /></div>I did not provide any explanation as I
couldnot find any place in the<br /> code to provide the documentation (that was just a modification of state<br />
machine).Should I do a separate write-up to explain the desired output<br /> and the changes to achieve it?<br /><div
class="im"><br/> ><br /> > In line with the lack of documentation, I would say that the choice of<br /> > the
name"parttoken" for the new token type is not helpful.  Part of<br /> > what?  And none of the other token type
namesinclude the word "token",<br /> > so that's not a good decision either.  Possibly "url_part" would be a<br />
>suitable name.<br /> ><br /><br /></div>I can modify it to output url-part/host-part/email-part/file-part if<br
/>there is an agreement over the rest of the issues. So let me know if I<br /> should go ahead with this.<br /><font
color="#888888"><br/> -Sushant.<br /><br /></font></blockquote></div><br /> 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Itagaki Takahiro
Дата:
Сообщение: Re: I: About "Our CLUSTER implementation is pessimal" patch
Следующее
От: Greg Smith
Дата:
Сообщение: Re: ask for review of MERGE