Re: Latin vs non-Latin words in text search parsing

Поиск
Список
Период
Сортировка
От Alvaro Herrera
Тема Re: Latin vs non-Latin words in text search parsing
Дата
Msg-id 20071021215953.GA12111@alvh.no-ip.org
обсуждение исходный текст
Ответ на Latin vs non-Latin words in text search parsing  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Latin vs non-Latin words in text search parsing  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Latin vs non-Latin words in text search parsing  ("Heikki Linnakangas" <heikki@enterprisedb.com>)
Список pgsql-hackers
Tom Lane wrote:

> ISTM that perhaps a more generally useful definition would be
> 
> lword        Only ASCII letters
> nlword        Entirely letters per iswalpha(), but not lword
> word        Entirely alphanumeric per iswalnum(), but not nlword
>         (hence, includes at least one digit)
> 
> However, I am no linguist and maybe I'm missing something.

I tend to agree with the need to redefine the categories.  I am not sure
I agree with this particular definition though.  I would think that a
"latin word" should include ASCII letters and accented letters, and a
non-latin word would be one that included only non-ASCII chars.

alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');Alias |  Description  |   Token   |
Dictionaries |      Lexized token       
 
-------+---------------+-----------+----------------+--------------------------word  | Word          | añadido   |
{spanish_stem}| spanish_stem: {añad}blank | Space symbols |           | {}             | word  | Word          | añadió
  | {spanish_stem} | spanish_stem: {añad}blank | Space symbols |           | {}             | word  | Word          |
añadidura| {spanish_stem} | spanish_stem: {añadidur}
 
(5 lignes)

I would think those would all fit in the "latin word" category.  This
example is more interesting because it shows a word categorized
differently just because the plural loses the accent:

alvherre=# select * from ts_debug('spanish', 'caracteres carácter');Alias |  Description  |   Token    |  Dictionaries
|     Lexized token       
 
-------+---------------+------------+----------------+--------------------------lword | Latin word    | caracteres |
{spanish_stem}| spanish_stem: {caracter}blank | Space symbols |            | {}             | word  | Word          |
carácter  | {spanish_stem} | spanish_stem: {caract}
 
(3 lignes)

I am not sure if there are any western european languages were words can
only be formed with non-ascii chars.  At least in spanish accents tend
to be rare.  However, I would think this is also wrong:

alvherre=# select * from ts_debug('french', 'à');Alias  |  Description   | Token | Dictionaries  |  Lexized token  
--------+----------------+-------+---------------+-----------------nlword | Non-latin word | à     | {french_stem} |
french_stem:{}
 
(1 ligne)

I don't think this is much of a problem, this particular word being
(most likely) a stopword.

So, how about

lword        Entirely letters per iswalpha, with at least one ASCII
nlword        Entirely letters per iswalpha
word        Entirely alphanumeric per iswalnum, but not nlword

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Latin vs non-Latin words in text search parsing
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Latin vs non-Latin words in text search parsing