Re: Latin vs non-Latin words in text search parsing

Поиск

Список

Период

Сортировка

От	Tatsuo Ishii
Тема	Re: Latin vs non-Latin words in text search parsing
Дата	22 октября 2007 г. 06:11:15
Msg-id	20071022.180947.55724535.t-ishii@sraoss.co.jp обсуждение исходный текст
Ответ на	Re: Latin vs non-Latin words in text search parsing ("Heikki Linnakangas" <heikki@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

> Alvaro Herrera wrote:
> > Tom Lane wrote:
> >
> >> ISTM that perhaps a more generally useful definition would be
> >>
> >> lword        Only ASCII letters
> >> nlword        Entirely letters per iswalpha(), but not lword
> >> word        Entirely alphanumeric per iswalnum(), but not nlword
> >>         (hence, includes at least one digit)
> > ...
> > I am not sure if there are any western european languages were words can
> > only be formed with non-ascii chars.
>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.
>
> > lword        Entirely letters per iswalpha, with at least one ASCII
> > nlword        Entirely letters per iswalpha
> > word        Entirely alphanumeric per iswalnum, but not nlword
>
> I don't like this categorization much more than the original. The
> distinction between lword and nlword is useless for most European
> languages.
>
> I suppose that Tom's argument that it's useful to distinguish words made
> of purely ASCII characters in computer-oriented stuff is valid, though I
> can't immediately think of a use case. For things like parsing a
> programming language, that's not really enough, so you'd probably end up
> writing your own parser anyway. I'm also not clear what the use case for
> the distinction between words with digits or not is. I don't think
> there's any natural languages where a word can contain digits, so it
> must be a computer-oriented thing as well.
>
> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?
>
> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.

Above is true but that does not neccessary mean that Tsearch is not
used for Japanese at all. I overcome the problem above by doing a
pre-process step which separate Japanese sentences to words devided by
white space. I wish I could write a new parser which could do the
job for 8.4 or later...

Please change the word definition very carefully.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Latin vs non-Latin words in text search parsing