Re: snowball ASCII stemmer configuration

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: snowball ASCII stemmer configuration
Дата
Msg-id 1333705.1592322037@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: snowball ASCII stemmer configuration  (Mark Dilger <mark.dilger@enterprisedb.com>)
Список pgsql-hackers
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I am a bit surprised to see that you are right about this, because non-latin languages often have
transliteration/romanizationschemes for writing the language in the Latin alphabet, developed before computers had wide
spreadadoption of non-ASCII character sets, and still in use today for text messaging.  I expected to find stemming
rulesfor transliterated words, but can't find any indication of that, neither in the postgres sources, nor in the
snowballsources I pulled from their repo.  Is there some architectural separation of stemming from transliteration such
thatwe'd never need to worry about it?  If snowball ever published stemmers for transliterated text, we might have to
revisitthis issue, but for now your proposed change sounds fine to me. 

Agreed, if the Snowball stemmers worked on romanized texts then the
situation would be different.  But they don't, AFAICS.  Don't know
if that is architectural, or a policy decision, or just lack of
round tuits.

The thing that I actually find a bit shaky in this area is our
architectural decision to route words to different dictionaries
depending on whether they are all-ASCII or not.  AIUI that was
done purely on the basis of the Russian/English case; it would
fail badly if say you wanted to separate Russian from French.
However, I have no great desire to revisit that design right now.

            regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Dilger
Дата:
Сообщение: Re: snowball ASCII stemmer configuration
Следующее
От: Fujii Masao
Дата:
Сообщение: Re: Review for GetWALAvailability()