Re: Latin vs non-Latin words in text search parsing

Поиск
Список
Период
Сортировка
От Tatsuo Ishii
Тема Re: Latin vs non-Latin words in text search parsing
Дата
Msg-id 20071022.180947.55724535.t-ishii@sraoss.co.jp
обсуждение исходный текст
Ответ на Re: Latin vs non-Latin words in text search parsing  ("Heikki Linnakangas" <heikki@enterprisedb.com>)
Список pgsql-hackers
> Alvaro Herrera wrote:
> > Tom Lane wrote:
> >
> >> ISTM that perhaps a more generally useful definition would be
> >>
> >> lword        Only ASCII letters
> >> nlword        Entirely letters per iswalpha(), but not lword
> >> word        Entirely alphanumeric per iswalnum(), but not nlword
> >>         (hence, includes at least one digit)
> > ...
> > I am not sure if there are any western european languages were words can
> > only be formed with non-ascii chars.
>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.
>
> > lword        Entirely letters per iswalpha, with at least one ASCII
> > nlword        Entirely letters per iswalpha
> > word        Entirely alphanumeric per iswalnum, but not nlword
>
> I don't like this categorization much more than the original. The
> distinction between lword and nlword is useless for most European
> languages.
>
> I suppose that Tom's argument that it's useful to distinguish words made
> of purely ASCII characters in computer-oriented stuff is valid, though I
> can't immediately think of a use case. For things like parsing a
> programming language, that's not really enough, so you'd probably end up
> writing your own parser anyway. I'm also not clear what the use case for
> the distinction between words with digits or not is. I don't think
> there's any natural languages where a word can contain digits, so it
> must be a computer-oriented thing as well.
>
> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?
>
> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.

Above is true but that does not neccessary mean that Tsearch is not
used for Japanese at all. I overcome the problem above by doing a
pre-process step which separate Japanese sentences to words devided by
white space. I wish I could write a new parser which could do the
job for 8.4 or later...

Please change the word definition very carefully.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Heikki Linnakangas"
Дата:
Сообщение: Re: Latin vs non-Latin words in text search parsing
Следующее
От: Oleg Bartunov
Дата:
Сообщение: Re: Ready for beta2?