Re: Latin vs non-Latin words in text search parsing

Поиск

Список

Период

Сортировка

От	Alvaro Herrera
Тема	Re: Latin vs non-Latin words in text search parsing
Дата	21 октября 2007 г. 22:00:06
Msg-id	20071021215953.GA12111@alvh.no-ip.org обсуждение исходный текст
Ответ на	Latin vs non-Latin words in text search parsing (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: Latin vs non-Latin words in text search parsing Re: Latin vs non-Latin words in text search parsing
Список	pgsql-hackers

Дерево обсуждения

Tom Lane wrote:

> ISTM that perhaps a more generally useful definition would be
> 
> lword        Only ASCII letters
> nlword        Entirely letters per iswalpha(), but not lword
> word        Entirely alphanumeric per iswalnum(), but not nlword
>         (hence, includes at least one digit)
> 
> However, I am no linguist and maybe I'm missing something.

I tend to agree with the need to redefine the categories.  I am not sure
I agree with this particular definition though.  I would think that a
"latin word" should include ASCII letters and accented letters, and a
non-latin word would be one that included only non-ASCII chars.

alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');Alias |  Description  |   Token   |
Dictionaries |      Lexized token       
 
-------+---------------+-----------+----------------+--------------------------word  | Word          | añadido   |
{spanish_stem}| spanish_stem: {añad}blank | Space symbols |           | {}             | word  | Word          | añadió
  | {spanish_stem} | spanish_stem: {añad}blank | Space symbols |           | {}             | word  | Word          |
añadidura| {spanish_stem} | spanish_stem: {añadidur}
 
(5 lignes)

I would think those would all fit in the "latin word" category.  This
example is more interesting because it shows a word categorized
differently just because the plural loses the accent:

alvherre=# select * from ts_debug('spanish', 'caracteres carácter');Alias |  Description  |   Token    |  Dictionaries
|     Lexized token       
 
-------+---------------+------------+----------------+--------------------------lword | Latin word    | caracteres |
{spanish_stem}| spanish_stem: {caracter}blank | Space symbols |            | {}             | word  | Word          |
carácter  | {spanish_stem} | spanish_stem: {caract}
 
(3 lignes)

I am not sure if there are any western european languages were words can
only be formed with non-ascii chars.  At least in spanish accents tend
to be rare.  However, I would think this is also wrong:

alvherre=# select * from ts_debug('french', 'à');Alias  |  Description   | Token | Dictionaries  |  Lexized token  
--------+----------------+-------+---------------+-----------------nlword | Non-latin word | à     | {french_stem} |
french_stem:{}
 
(1 ligne)

I don't think this is much of a problem, this particular word being
(most likely) a stopword.

So, how about

lword        Entirely letters per iswalpha, with at least one ASCII
nlword        Entirely letters per iswalpha
word        Entirely alphanumeric per iswalnum, but not nlword

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Tom Lane
Дата: 21 октября 2007 г., 20:48:03
Сообщение: Latin vs non-Latin words in text search parsing

Следующее

От: Tom Lane
Дата: 21 октября 2007 г., 22:46:47
Сообщение: Re: Latin vs non-Latin words in text search parsing

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Latin vs non-Latin words in text search parsing

Предыдущее

Следующее