Re: Latin vs non-Latin words in text search parsing

Поиск

Список

Период

Сортировка

От	Gregory Stark
Тема	Re: Latin vs non-Latin words in text search parsing
Дата	22 октября 2007 г. 07:32:16
Msg-id	871wbn1luo.fsf@oxford.xeocode.com обсуждение
Ответ на	Re: Latin vs non-Latin words in text search parsing ("Heikki Linnakangas" <heikki@enterprisedb.com>)
Ответы	Re: Latin vs non-Latin words in text search parsing
Список	pgsql-hackers

Дерево обсуждения

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> Alvaro Herrera wrote:
>> Tom Lane wrote:
>>
>>> ISTM that perhaps a more generally useful definition would be
>>>
>>> lword        Only ASCII letters
>>> nlword        Entirely letters per iswalpha(), but not lword
>>> word        Entirely alphanumeric per iswalnum(), but not nlword
>>>         (hence, includes at least one digit)
>> ...
>> I am not sure if there are any western european languages were words can
>> only be formed with non-ascii chars.
>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.

For what it's worth I did the same search last night and found three French
words including "çà" -- which admittedly is likely to be a noise word. Other
dictionaries such as Italian and Irish also have one-letter words like this.
The only other with multi-letter words is actually Faroese with "íð" and "óð".

> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?

I'm not very familiar with the use case here. Is there a good reason to want
to abbreviate these names? I think I would expect "ascii", "word", and "token"
for the three categories Tom describes.

> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.

I also wonder about languages like Arabic and Hindi which do have words but
I'm not sure if they use white space as simply as in latin languages.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Latin vs non-Latin words in text search parsing