Re: [GENERAL] Fragments in tsearch2 headline
| От | Sushant Sinha | 
|---|---|
| Тема | Re: [GENERAL] Fragments in tsearch2 headline | 
| Дата | |
| Msg-id | 1212462829.8047.38.camel@dragflick обсуждение исходный текст | 
| Ответ на | Re: [GENERAL] Fragments in tsearch2 headline (Teodor Sigaev <teodor@sigaev.ru>) | 
| Ответы | Re: [GENERAL] Fragments in tsearch2 headline | 
| Список | pgsql-hackers | 
Efficiency: I realized that we do not need to store all norms. We need to only store store norms that are in the query. So I moved the addition of norms from addHLParsedLex to hlfinditem. This should add very little memory overhead to existing headline generation. If this is still not acceptable for default headline generation, then I can push it into mark_hl_fragments. But I think any headline marking function will benefit by having the norms corresponding to the query. Why we need norms? hlCover does the exact thing that Cover in tsrank does which is to find the cover that contains the query. However hlcover has to go through words that do not match the query. Cover on the other hand operates on position indexes for just the query words and so it should be faster. The main reason why I would I like it to be fast is that I want to generate all covers for a given query. Then choose covers with smallest length as they will be the one that will best explain relation of a query to a document. Finally stretch those covers to the specified size. In my understanding, the current headline generation tries to find the biggest cover for display in the headline. I personally think that such a cover does not explain the context of a query in a document. We may differ on this and thats why we may need both options. Let me know what you think on this patch and I will update the patch to respect other options like MinWords and ShortWord. NumFragments < 2: I wanted people to use the new headline marker if they specify NumFragments >= 1. If they do not specify the NumFragments or put it to 0 then the default marker will be used. This becomes a bit of tricky parameter so please put in any idea on how to trigger the new marker. On an another note I found that make_tsvector crashes if it receives a ParsedText with curwords = 0. Specifically uniqueWORD returns curwords as 1 even when it gets 0 words. I am not sure if this is the desired behavior. -Sushant. On Mon, 2008-06-02 at 18:10 +0400, Teodor Sigaev wrote: > > I have attached a new patch with respect to the current cvs head. This > > produces headline in a document for a given query. Basically it > > identifies fragments of text that contain the query and displays them. > New variant is much better, but... > > > HeadlineParsedText contains an array of actual words but not > > information about the norms. We need an indexed position vector for each > > norm so that we can quickly evaluate a number of possible fragments. > > Something that tsvector provides. > > Why do you need to store norms? The single purpose of norms is identifying words > from query - but it's already done by hlfinditem. It sets > HeadlineWordEntry->item to corresponding QueryOperand in tsquery. > Look, headline function is rather expensive and your patch adds a lot of extra > work - at least in memory usage. And if user calls with NumFragments=0 the that > work is unneeded. > > > This approach does not change any other interface and fits nicely with > > the overall framework. > Yeah, it's a really big step forward. Thank you. You are very close to > committing except: Did you find a hlCover() function which produce a cover from > original HeadlineParsedText representation? Is any reason to do not use it? > > > > > The norms are converted into tsvector and a number of covers are > > generated. The best covers are then chosen to be in the headline. The > > covers are separated using a hardcoded coversep. Let me know if you want > > to expose this as an option. > > > > > > Covers that overlap with already chosen covers are excluded. > > > > Some options like ShortWord and MinWords are not taken care of right > > now. MaxWords are used as maxcoversize. Let me know if you would like to > > see other options for fragment generation as well. > ShortWord, MinWords and MaxWords should store their meaning, but for each > fragment, not for the whole headline. > > > > > > Let me know any more changes you would like to see. > > if (num_fragments == 0) > /* call the default headline generator */ > mark_hl_words(prs, query, highlight, shortword, min_words, max_words); > else > mark_hl_fragments(prs, query, highlight, num_fragments, max_words); > > > Suppose, num_fragments < 2? >
Вложения
В списке pgsql-hackers по дате отправления: