Обсуждение: ts_headline and query with hyphen
Hi I have a question about ts_headline, when the query includes word like 'on-line' - only the 'line' part is highlighted, even though the whole phrase is indexed too, some details below. Postgresql 9.1.6 select token, dictionary, lexemes from ts_debug('play on-line') where alias <> 'blank'; token | dictionary | lexemes ---------+--------------+---------- play | english_stem | {play} on-line | english_stem | {on-lin} on | english_stem | {} line | english_stem | {line} select to_tsquery('play & on-line'); to_tsquery ---------------------------- 'play' & 'on-lin' & 'line' select ts_headline('play on-line', to_tsquery('play & on-line')); ts_headline ---------------------------- <b>play</b> on-<b>line</b> Same as select ts_headline('play on-line', to_tsquery('play & line')); ts_headline ---------------------------- <b>play</b> on-<b>line</b> Is that the intended behaviour? I guess the problem here is that 'on' is not a lexem, but then what about on-lin? In another example, I thought that a hyphenated match would have some kind of preference select token, dictionary, lexemes from ts_debug('custom-built query') where alias <> 'blank'; token | dictionary | lexemes --------------+--------------+---------------- custom-built | english_stem | {custom-built} custom | english_stem | {custom} built | english_stem | {built} query | english_stem | {queri} select to_tsquery('query & custom-built'); to_tsquery ----------------------------------------------- 'queri' & 'custom-built' & 'custom' & 'built' select ts_headline('custom-built query', to_tsquery('query & custom-built')); ts_headline ----------------------------------------- <b>custom</b>-<b>built</b> <b>query</b> This works better, but still both parts of 'custom-built' are highlighted separately. But maybe ts_headline understands or operates on single, not hyphenated words only? thanks daniel
daniel <dochtorek@gmail.com> writes: > I have a question about ts_headline, when the query includes word like > 'on-line' - only the 'line' part is highlighted, even though the whole > phrase is indexed too, some details below. Part of the reason is that "on" is a stop word (at least in the default english dictionary). That's why you get > select to_tsquery('play & on-line'); > to_tsquery > ---------------------------- > 'play' & 'on-lin' & 'line' and not "'play' & 'on-lin' & 'on' & 'line'". If you did get the latter then you'd get a headline result with both parts highlighted, similar to your "custom-built" case. > But maybe ts_headline understands or operates on > single, not hyphenated words only? Dunno. It would seem reasonable to highlight the whole compound in these cases, but I have no idea how hard that is. Another thing that seems a bit odd here is that we seem to be stemming the compound word as a whole, but not the individual parts. Not sure how sane that combination of choices is ... regards, tom lane
On 12/05/2012 04:49 AM, Tom Lane wrote: > daniel <dochtorek@gmail.com> writes: >> I have a question about ts_headline, when the query includes word like >> 'on-line' - only the 'line' part is highlighted, even though the whole >> phrase is indexed too, some details below. > > Part of the reason is that "on" is a stop word (at least in the default > english dictionary). That's why you get > >> select to_tsquery('play & on-line'); >> to_tsquery >> ---------------------------- >> 'play' & 'on-lin' & 'line' > > and not "'play' & 'on-lin' & 'on' & 'line'". If you did get the latter > then you'd get a headline result with both parts highlighted, similar to > your "custom-built" case. > I understand the 'on' part, but still, 'on-lin' is passed to the ts_headline, so I thought that match would be preferred over 'line' and highlighted as a whole. Additionally, with a specific value of MaxWords I could see a dangling "line" at the start of a headline ("on-" has been cut off), which is kinda troubling, because it's not even an English document. It doesn't seem to happen to queries like 'custom-built' - I can't see it being split neither in the beginning of a headline nor at the end. Just to be clear - the headline with cut off "on-" is OK (having the matched stuff somewhere in the middle, though with highlighted 'line' only), it's just that the word 'on-line' is used multiple times in the doc and it happended to appear at the beginning of a headline. Cutting was not affected by ShortWord setting, so I guess it's a stopword thing again. If that's the case, then IMHO it should treat hyphenated words as 1 when creating the headline and not cut off like that. But maybe it was intended to work like that.. >> But maybe ts_headline understands or operates on >> single, not hyphenated words only? > > Dunno. It would seem reasonable to highlight the whole compound in > these cases, but I have no idea how hard that is. > Right, although that latter case is easy to fix outside postgres and still looks fine - I've included it just as an example. Former causes a few problems in specific cases, I have to fix them manually now, word by word. > Another thing that seems a bit odd here is that we seem to be stemming > the compound word as a whole, but not the individual parts. Not sure > how sane that combination of choices is ... > Good question, hope others will jump in. thanks, daniel
As a follow up to my previous comment, this is a cutting example select ts_headline('game played on-line', to_tsquery('on-line & game'), 'MaxWords=3,MinWords=2,ShortWord=1'); ts_headline ----------------------- <b>game</b> played on that can't be right... daniel