Re: snowball ASCII stemmer configuration

Поиск

Список

Период

Сортировка

От	Mark Dilger
Тема	Re: snowball ASCII stemmer configuration
Дата	16 июня 2020 г. 18:25:03
Msg-id	F260AAF4-FE8A-41C0-A3CA-3CEBE73C134A@enterprisedb.com обсуждение исходный текст
Ответ на	Re: snowball ASCII stemmer configuration (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: snowball ASCII stemmer configuration
Список	pgsql-hackers

Дерево обсуждения

> On Jun 16, 2020, at 7:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>>> Moreover, AFAIK, the following other languages do not use Latin-based
>>> alphabets:
>
>>> arabic      arabic      \
>>> greek       greek       \
>>> nepali      nepali      \
>>> tamil       tamil       \
>
>> Hmm.  I think all of those entries are ones that got added by me while
>> absorbing post-2007 Snowball updates, and I confess that I did not think
>> about this point.  Maybe these should be changed.
>
> After further reflection, I think these are indeed mistakes and we should
> change them all.  The argument for the Russian/English case, AIUI, is
> "if we come across an all-ASCII word, it is most certainly not Russian,
> and the most likely Latin-based language is English".  Given the world
> as it is, I think the same argument works for all non-Latin-alphabet
> languages.  Obviously specific applications might have a different idea
> of the best fallback language, but that's why we let users make their
> own text search configurations.  For general-purpose use, falling back
> to English seems reasonable.  And we can be dead certain that applying
> a Greek stemmer to an ASCII word will do nothing useful, so the
> configuration choice shown above is unhelpful.

I am a bit surprised to see that you are right about this, because non-latin languages often have
transliteration/romanizationschemes for writing the language in the Latin alphabet, developed before computers had wide
spreadadoption of non-ASCII character sets, and still in use today for text messaging.  I expected to find stemming
rulesfor transliterated words, but can't find any indication of that, neither in the postgres sources, nor in the
snowballsources I pulled from their repo.  Is there some architectural separation of stemming from transliteration such
thatwe'd never need to worry about it?  If snowball ever published stemmers for transliterated text, we might have to
revisitthis issue, but for now your proposed change sounds fine to me. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Robert Haas
Дата: 16 июня 2020 г., 18:20:07
Сообщение: Re: Parallel Seq Scan vs kernel read ahead

Следующее

От: Tom Lane
Дата: 16 июня 2020 г., 18:40:37
Сообщение: Re: snowball ASCII stemmer configuration

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: snowball ASCII stemmer configuration

Предыдущее

Следующее