Hi everyone,
I was here wondering whether HTML parsing should separate tokens that are not separated by spaces in the original text, but are separated by an inline element. Let me show you an example:
SELECT to_tsvector('english', 'Hello<p>neighbor</p>, you are <strong>n</strong>i<em>ce</em>')
Results: "'ce':7 'hello':1 'n':5 'neighbor':2"
"Hello" and "neighbor" should really be separated, because <p> is a block element, but "nice" should be a single word there, since there is no visual separation when rendered (<em> and <strong> are inline elements).
Sorry if this has been asked before, but I couldn't find it anywhere.
Thanks in advance,
Marcelo.