wparser misbehavior(?) for corner cases with hyphenated words
От | Tom Lane |
---|---|
Тема | wparser misbehavior(?) for corner cases with hyphenated words |
Дата | |
Msg-id | 6269.1193184058@sss.pgh.pa.us обсуждение исходный текст |
Список | pgsql-hackers |
This does not seem right: regression=# select alias,description,token from ts_debug('foo-8.3beta'); alias | description | token -----------------+-------------------------------------+---------numhword | Hyphenated word, letters and digits |foo-8.3hword_asciipart | Hyphenated word part, all ASCII | fooblank | Space symbols | -float | Decimal notation | 8.3hword_asciipart | Hyphenated word part, all ASCII | beta (5 rows) (Code from just before my last commit behaves the same, modulo names of token types, so I didn't break it just now.) Surely, if "beta" is an hword part here, it should have been reported as part of the overall hword. However, this is all pretty inconsistent, because if "8.3" had been in the first chunk of text then we'd not have considered it part of an hword at all: regression=# select alias,description,token from ts_debug('8.3beta-foo'); alias | description | token -----------------+---------------------------------+----------float | Decimal notation | 8.3asciihword | Hyphenated word, all ASCII | beta-foohword_asciipart | Hyphenated word part, all ASCII | betablank | Space symbols | -hword_asciipart | Hyphenated word part, all ASCII | foo (5 rows) regression=# select alias,description,token from ts_debug('beta8.3-foo');alias | description | token -------+-------------------+-------------file | File or path name | beta8.3-foo (1 row) regression=# select alias,description,token from ts_debug('foo-beta8.3-foo'); alias | description | token -----------------+------------------------------------------+-----------numhword | Hyphenated word, letters and digits | foo-beta8hword_asciipart | Hyphenated word part, all ASCII | fooblank | Space symbols | -hword_numpart | Hyphenated word part, letters and digits | beta8blank | Space symbols | .uint | Unsigned integer | 3blank | Spacesymbols | -asciiword | Word, all ASCII | foo (8 rows) I'm of the opinion that in no circumstance should "." be considered part of an hword: the definition of word should not be allowed to stretch beyond letters and digits. So I think the second and fourth examples I showed above are correct. The third (where it concludes it's a filename) is maybe a bit odd, but in any case it's not an hword so I won't complain. I think the first example ought to parse as asciiword fooblank -float 8.3asciiword foo (Or maybe the '-' should fold into the float? Don't care much...) This is all a little bit tricky, since this behavior seems reasonable: regression=# select alias,description,token from ts_debug('foo-83beta'); alias | description | token -----------------+------------------------------------------+------------numhword | Hyphenated word, letters and digits | foo-83betahword_asciipart | Hyphenated word part, all ASCII | fooblank | Space symbols | -hword_numpart | Hyphenated word part, letters and digits | 83beta (4 rows) regression=# select alias,description,token from ts_debug('83beta-foo'); alias | description | token -----------------+------------------------------------------+------------numhword | Hyphenated word, letters and digits | 83beta-foohword_numpart | Hyphenated word part, letters and digits | 83betablank | Space symbols | -hword_asciipart | Hyphenated word part, all ASCII | foo (4 rows) Basically I'm arguing that a string should be considered valid as a second or subsequent component of an hword if and only if it would be considered valid as the first component. Comments? regards, tom lane
В списке pgsql-hackers по дате отправления: