wparser misbehavior(?) for corner cases with hyphenated words

Поиск
Список
Период
Сортировка
От Tom Lane
Тема wparser misbehavior(?) for corner cases with hyphenated words
Дата
Msg-id 6269.1193184058@sss.pgh.pa.us
обсуждение исходный текст
Список pgsql-hackers
This does not seem right:

regression=# select alias,description,token from ts_debug('foo-8.3beta');     alias      |             description
      |  token  
 
-----------------+-------------------------------------+---------numhword        | Hyphenated word, letters and digits
|foo-8.3hword_asciipart | Hyphenated word part, all ASCII     | fooblank           | Space symbols
| -float           | Decimal notation                    | 8.3hword_asciipart | Hyphenated word part, all ASCII     |
beta
(5 rows)

(Code from just before my last commit behaves the same, modulo names of
token types, so I didn't break it just now.)

Surely, if "beta" is an hword part here, it should have been reported as
part of the overall hword.  However, this is all pretty inconsistent,
because if "8.3" had been in the first chunk of text then we'd not have
considered it part of an hword at all:

regression=# select alias,description,token from ts_debug('8.3beta-foo');     alias      |           description
  |  token   
 
-----------------+---------------------------------+----------float           | Decimal notation                |
8.3asciihword     | Hyphenated word, all ASCII      | beta-foohword_asciipart | Hyphenated word part, all ASCII |
betablank          | Space symbols                   | -hword_asciipart | Hyphenated word part, all ASCII | foo
 
(5 rows)

regression=# select alias,description,token from ts_debug('beta8.3-foo');alias |    description    |    token    
-------+-------------------+-------------file  | File or path name | beta8.3-foo
(1 row)

regression=# select alias,description,token from ts_debug('foo-beta8.3-foo');     alias      |
description               |   token   
 
-----------------+------------------------------------------+-----------numhword        | Hyphenated word, letters and
digits     | foo-beta8hword_asciipart | Hyphenated word part, all ASCII          | fooblank           | Space symbols
                        | -hword_numpart   | Hyphenated word part, letters and digits | beta8blank           | Space
symbols                           | .uint            | Unsigned integer                         | 3blank           |
Spacesymbols                            | -asciiword       | Word, all ASCII                          | foo
 
(8 rows)

I'm of the opinion that in no circumstance should "." be considered part
of an hword: the definition of word should not be allowed to stretch
beyond letters and digits.  So I think the second and fourth examples
I showed above are correct.  The third (where it concludes it's a
filename) is maybe a bit odd, but in any case it's not an hword so I won't
complain.  I think the first example ought to parse as
asciiword    fooblank        -float        8.3asciiword    foo

(Or maybe the '-' should fold into the float?  Don't care much...)

This is all a little bit tricky, since this behavior seems reasonable:

regression=# select alias,description,token from ts_debug('foo-83beta');     alias      |               description
          |   token    
 
-----------------+------------------------------------------+------------numhword        | Hyphenated word, letters and
digits     | foo-83betahword_asciipart | Hyphenated word part, all ASCII          | fooblank           | Space symbols
                         | -hword_numpart   | Hyphenated word part, letters and digits | 83beta
 
(4 rows)

regression=# select alias,description,token from ts_debug('83beta-foo');     alias      |               description
          |   token    
 
-----------------+------------------------------------------+------------numhword        | Hyphenated word, letters and
digits     | 83beta-foohword_numpart   | Hyphenated word part, letters and digits | 83betablank           | Space
symbols                           | -hword_asciipart | Hyphenated word part, all ASCII          | foo
 
(4 rows)

Basically I'm arguing that a string should be considered valid as a
second or subsequent component of an hword if and only if it would be
considered valid as the first component.

Comments?
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Feature Freeze date for 8.4
Следующее
От: Josh Berkus
Дата:
Сообщение: Re: Feature Freeze date for 8.4