Обсуждение: BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled

Поиск

Список

Период

Сортировка

BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled

От

PG Bug reporting form

Дата:

12 июля 2018 г., 13:59:40

The following bug has been logged on the website:

Bug reference:      15277
Logged by:          Dan Book
Email address:      grinnz@gmail.com
PostgreSQL version: 9.6.9
Operating system:   CentOS 7
Description:

This post has a good overview of the issue with a reproduction case:
https://stackoverflow.com/questions/40263956/why-is-postgresql-stripping-html-entities-in-ts-headline

I have text that is not HTML and contains things that look like HTML tags.
The headlines are HTML escaped when output. It is very odd to have this text
missing from the resulting headlines and no way to control the behavior.

Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled

От

Arthur Zakirov

Дата:

12 июля 2018 г., 15:22:06

Hello,

On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote:
> I have text that is not HTML and contains things that look like HTML tags.
> The headlines are HTML escaped when output. It is very odd to have this text
> missing from the resulting headlines and no way to control the behavior.

<b> and </b> are recognized as "tag" token. By default they are
ignored. You need to modify existing configuration or create new one:

=# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
=# alter text search configuration english_tag
   add mapping for tag with simple;

Then tags aren't skipped:

=# select * from ts_debug('english_tag', 'query <b>test</b>');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
 blank     | Space symbols   |       | {}             | (null)       | (null)
 tag       | XML tag         | <b>   | {simple}       | simple       | {<b>}
 asciiword | Word, all ASCII | test  | {english_stem} | english_stem | {test}
 tag       | XML tag         | </b>  | {simple}       | simple       | {</b>}

But even in this case ts_headline will skip tags. Because it is
hardcoded [1].

I think it isn't good to change the behaviour for existing versions of
PostgreSQL. But there is a workaround of course if it is appropriate for
someone. It is possible to create your own text search parser extension.
Example [2]. And change

#define HLIDREPLACE(x)    ( (x)==TAG_T )

to

#define HLIDREPLACE(x)    ( false )


1 - https://github.com/postgres/postgres/blob/master/src/backend/tsearch/wparser_def.c#L1923
2 - https://github.com/postgrespro/pg_tsparser

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled

От

Dan Book

Дата:

12 июля 2018 г., 21:33:52

On Thu, Jul 12, 2018 at 5:22 AM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

Hello,

On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote:
> I have text that is not HTML and contains things that look like HTML tags.
> The headlines are HTML escaped when output. It is very odd to have this text
> missing from the resulting headlines and no way to control the behavior.

 and are recognized as "tag" token. By default they are
ignored. You need to modify existing configuration or create new one:

=# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
=# alter text search configuration english_tag
add mapping for tag with simple;

Then tags aren't skipped:

=# select * from ts_debug('english_tag', 'query test');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
blank | Space symbols | | {} | (null) | (null)
tag | XML tag | | {simple} | simple | {}
asciiword | Word, all ASCII | test | {english_stem} | english_stem | {test}
tag | XML tag | | {simple} | simple | {}

But even in this case ts_headline will skip tags. Because it is
hardcoded [1].

I think it isn't good to change the behaviour for existing versions of
PostgreSQL. But there is a workaround of course if it is appropriate for
someone. It is possible to create your own text search parser extension.
Example [2]. And change

#define HLIDREPLACE(x) ( (x)==TAG_T )

to

#define HLIDREPLACE(x) ( false )

Thanks for the response. It's good to know this is possible but defining a custom parser is not ideal.

-Dan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled

BUG #15277: ts_headline strips things that look like HTML tags and itcannot be disabled

Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled

Re: BUG #15277: ts_headline strips things that look like HTML tagsand it cannot be disabled