Re: old bug in full text parser

Поиск
Список
Период
Сортировка
От Oleg Bartunov
Тема Re: old bug in full text parser
Дата
Msg-id CAF4Au4xrkE5yHbNDBg+0Cn0VLKm9c+SD13No0yUix483_F2bvw@mail.gmail.com
обсуждение исходный текст
Ответ на old bug in full text parser  (Oleg Bartunov <obartunov@gmail.com>)
Список pgsql-hackers


On Wed, Feb 10, 2016 at 12:28 PM, Oleg Bartunov <obartunov@gmail.com> wrote:
It  looks like there is a very old bug in full text parser (somebody pointed me on it), which appeared after moving tsearch2 into the core.  The problem is in how full text parser process hyphenated words. Our original idea was to report hyphenated word itself as well as its parts and ignore hyphen. That was how tsearch2 works.

This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
2.  Hyphenated words with numbers ('4-dot', 'dot-4')  processed differently than ones with plain text words like 'four-dot', no hyphenated word itself reported.

I think we should consider this as a bug and produce fix for all supported versions.

After  investigation we found this commit:

commit 73e6f9d3b61995525785b2f4490b465fe860196b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sat Oct 27 19:03:45 2007 +0000

    Change text search parsing rules for hyphenated words so that digit strings
    containing decimal points aren't considered part of a hyphenated word.
    Sync the hyphenated-word lookahead states with the subsequent part-by-part
    reparsing states so that we don't get different answers about how much text
    is part of the hyphenated word.  Per my gripe of a few days ago.


8.2.23

select tok_type, description, token from ts_debug('dot-four');
  tok_type   |          description          |  token
-------------+-------------------------------+----------
 lhword      | Latin hyphenated word         | dot-four
 lpart_hword | Latin part of hyphenated word | dot
 lpart_hword | Latin part of hyphenated word | four
(3 rows)

select tok_type, description, token from ts_debug('dot-4');
  tok_type   |          description          | token
-------------+-------------------------------+-------
 hword       | Hyphenated word               | dot-4
 lpart_hword | Latin part of hyphenated word | dot
 uint        | Unsigned integer              | 4
(3 rows)

select tok_type, description, token from ts_debug('4-dot');
 tok_type |   description    | token
----------+------------------+-------
 uint     | Unsigned integer | 4
 lword    | Latin word       | dot
(2 rows)

8.3.23

select alias, description, token from ts_debug('dot-four');
      alias      |           description           |  token
-----------------+---------------------------------+----------
 asciihword      | Hyphenated word, all ASCII      | dot-four
 hword_asciipart | Hyphenated word part, all ASCII | dot
 blank           | Space symbols                   | -
 hword_asciipart | Hyphenated word part, all ASCII | four
(4 rows)

select alias, description, token from ts_debug('dot-4');
   alias   |   description   | token
-----------+-----------------+-------
 asciiword | Word, all ASCII | dot
 int       | Signed integer  | -4
(2 rows)

select alias, description, token from ts_debug('4-dot');
   alias   |   description    | token
-----------+------------------+-------
 uint      | Unsigned integer | 4
 blank     | Space symbols    | -
 asciiword | Word, all ASCII  | dot
(3 rows)



Oh, one more bug, which existed even in tsearch2.

select tok_type, description, token from ts_debug('4-dot');
 tok_type |   description    | token
----------+------------------+-------
 uint     | Unsigned integer | 4
 lword    | Latin word       | dot
(2 rows)



 

Regards,
Oleg

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Relation extension scalability
Следующее
От: Ashutosh Bapat
Дата:
Сообщение: Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)