Обсуждение: BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled

Поиск
Список
Период
Сортировка

BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15277
Logged by:          Dan Book
Email address:      grinnz@gmail.com
PostgreSQL version: 9.6.9
Operating system:   CentOS 7
Description:

This post has a good overview of the issue with a reproduction case:
https://stackoverflow.com/questions/40263956/why-is-postgresql-stripping-html-entities-in-ts-headline

I have text that is not HTML and contains things that look like HTML tags.
The headlines are HTML escaped when output. It is very odd to have this text
missing from the resulting headlines and no way to control the behavior.


Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled

От
Arthur Zakirov
Дата:
Hello,

On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote:
> I have text that is not HTML and contains things that look like HTML tags.
> The headlines are HTML escaped when output. It is very odd to have this text
> missing from the resulting headlines and no way to control the behavior.

<b> and </b> are recognized as "tag" token. By default they are
ignored. You need to modify existing configuration or create new one:

=# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
=# alter text search configuration english_tag
   add mapping for tag with simple;

Then tags aren't skipped:

=# select * from ts_debug('english_tag', 'query <b>test</b>');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
 blank     | Space symbols   |       | {}             | (null)       | (null)
 tag       | XML tag         | <b>   | {simple}       | simple       | {<b>}
 asciiword | Word, all ASCII | test  | {english_stem} | english_stem | {test}
 tag       | XML tag         | </b>  | {simple}       | simple       | {</b>}

But even in this case ts_headline will skip tags. Because it is
hardcoded [1].

I think it isn't good to change the behaviour for existing versions of
PostgreSQL. But there is a workaround of course if it is appropriate for
someone. It is possible to create your own text search parser extension.
Example [2]. And change

#define HLIDREPLACE(x)    ( (x)==TAG_T )

to

#define HLIDREPLACE(x)    ( false )


1 - https://github.com/postgres/postgres/blob/master/src/backend/tsearch/wparser_def.c#L1923
2 - https://github.com/postgrespro/pg_tsparser

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


On Thu, Jul 12, 2018 at 5:22 AM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
Hello,

On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote:
> I have text that is not HTML and contains things that look like HTML tags.
> The headlines are HTML escaped when output. It is very odd to have this text
> missing from the resulting headlines and no way to control the behavior.

<b> and </b> are recognized as "tag" token. By default they are
ignored. You need to modify existing configuration or create new one:

=# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
=# alter text search configuration english_tag
   add mapping for tag with simple;

Then tags aren't skipped:

=# select * from ts_debug('english_tag', 'query <b>test</b>');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
 blank     | Space symbols   |       | {}             | (null)       | (null)
 tag       | XML tag         | <b>   | {simple}       | simple       | {<b>}
 asciiword | Word, all ASCII | test  | {english_stem} | english_stem | {test}
 tag       | XML tag         | </b>  | {simple}       | simple       | {</b>}

But even in this case ts_headline will skip tags. Because it is
hardcoded [1].

I think it isn't good to change the behaviour for existing versions of
PostgreSQL. But there is a workaround of course if it is appropriate for
someone. It is possible to create your own text search parser extension.
Example [2]. And change

#define HLIDREPLACE(x)  ( (x)==TAG_T )

to

#define HLIDREPLACE(x)  ( false )

Thanks for the response. It's good to know this is possible but defining a custom parser is not ideal.

-Dan