inconsistency in full-text search tokenization

Поиск
Список
Период
Сортировка
От Valentin Gatien-Baron
Тема inconsistency in full-text search tokenization
Дата
Msg-id CA+0DEqhdmhie8MMmodE3qNogu0mbrTA+i-vTdjznEZ5fX2CbbQ@mail.gmail.com
обсуждение исходный текст
Список pgsql-bugs
Hello,

I observe the following:

select to_tsvector('simple', 'bla bla ./aaa bla bla'),
       phraseto_tsquery('simple', './aaa'),
       to_tsvector('simple', 'bla bla ./aaa bla bla') @@ phraseto_tsquery('simple', './aaa') as matches;
      to_tsvector       | phraseto_tsquery | matches
------------------------+------------------+---------
 '/aaa':3 'bla':1,2,4,5 | './aaa'          | f
(1 row)

I expected that any space-separated bit of text in the input can be
selected, turned into a query, and would match the initial text. It's
not the case here because as you can see, './aaa' is tokenized as
'./aaa' at start of text but as '/aaa' after spaces.

I looked for more such cases, and my limited testing only found such
a problem with '.' and '~' at start of text:

select
  quote_literal(text1) as qtext1,
  quote_literal(text2) as qtext2,
  ts_vector1,
  ts_vector2,
  array(select alias || ':' || quote_literal(token) from ts_debug('simple', text1)) as ts_debug1,
  array(select alias || ':' || quote_literal(token) from ts_debug('simple', text2)) as ts_debug2,
  ts_vector1 @@ phraseto_tsquery(text2) as phraseto_match
from
  unnest(array['', ')']) as zz0(prefix),
  (select chr(a) as char1 from generate_series(1,128) as s1(a) where (a not between 49 and 57) and (a not between 65 and 90) and (a not between 98 and 122)) as zz1,
  (select chr(a) as char2 from generate_series(1,128) as s1(a) where (a not between 49 and 57) and (a not between 65 and 90) and (a not between 98 and 122)) as zz2,
  (select chr(a) as char3 from generate_series(1,128) as s1(a) where (a not between 49 and 57) and (a not between 65 and 90) and (a not between 98 and 122)) as zz3,
  lateral (select prefix ||          char1 || char2 || char3 as text1,
           prefix || ' '   || char1 || char2 || char3 as text2,
 prefix ||          char1 || char2 || ' ' as text11,
 prefix || ' '   || char1 || char2 || ' ' as text22) zz4,
  lateral (select to_tsvector('simple', text1) as ts_vector1,
           to_tsvector('simple', text2) as ts_vector2,
   to_tsvector('simple', text11) as ts_vector11,
   to_tsvector('simple', text22) as ts_vector22) as zz8
where
  ts_vector1 != ts_vector2
  and (ts_vector11 = ts_vector22 or char3 = ' ')
;
 qtext1 | qtext2 | ts_vector1 | ts_vector2 |        ts_debug1        |                ts_debug2                 | phraseto_match
--------+--------+------------+------------+-------------------------+------------------------------------------+----------------
 '.. '  | ' .. ' | '..':1     |            | {file:'..',"blank:' '"} | {"blank:' .. '"}                         | f
 '~0 '  | ' ~0 ' | '~0':1     | '0':1      | {file:'~0',"blank:' '"} | {"blank:' ~'",uint:'0',"blank:' '"}      | f
 '~_ '  | ' ~_ ' | '~_':1     |            | {file:'~_',"blank:' '"} | {"blank:' ~_ '"}                         | f
 '~a '  | ' ~a ' | '~a':1     | 'a':1      | {file:'~a',"blank:' '"} | {"blank:' ~'",asciiword:'a',"blank:' '"} | f
 './0'  | ' ./0' | './0':1    | '/0':1     | {file:'./0'}            | {"blank:' .'",file:'/0'}                 | f
 '~/0'  | ' ~/0' | '~/0':1    | '/0':1     | {file:'~/0'}            | {"blank:' ~'",file:'/0'}                 | f
 './_'  | ' ./_' | './_':1    | '/_':1     | {file:'./_'}            | {"blank:' .'",file:'/_'}                 | f
 '~/_'  | ' ~/_' | '~/_':1    | '/_':1     | {file:'~/_'}            | {"blank:' ~'",file:'/_'}                 | f
 './a'  | ' ./a' | './a':1    | '/a':1     | {file:'./a'}            | {"blank:' .'",file:'/a'}                 | f
 '~/a'  | ' ~/a' | '~/a':1    | '/a':1     | {file:'~/a'}            | {"blank:' ~'",file:'/a'}                 | f
(10 rows)




select version();
                                                 version                                                
---------------------------------------------------------------------------------------------------------
 PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
(1 row)

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Bharath Rupireddy
Дата:
Сообщение: Re: BUG #16997: parameter server_encoding's category problem
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17002: GPG signature is missing in many redhat repos.