Обсуждение: snowball ASCII stemmer configuration

Поиск

Список

Период

Сортировка

snowball ASCII stemmer configuration

От

Peter Eisentraut

Дата:

16 июня 2020 г., 11:16:21

While I was updating the snowball code, I noticed something strange.  In 
src/backend/snowball/Makefile:

# first column is language name and also name of dictionary for 
not-all-ASCII
# words, second is name of dictionary for all-ASCII words
# Note order dependency: use of some other language as ASCII dictionary
# must come after creation of that language
LANGUAGES=  \
     arabic      arabic      \
     basque      basque      \
     catalan     catalan     \
etc.

There are two cases where these two columns are not the same:

     hindi       english     \
     russian     english     \

The second one is old; the first one I added using the second one as 
example.  But I wonder what the rationale for this is.  Maybe for hindi 
one could make some kind of cultural argument, but for russian this 
seems entirely arbitrary.  Perhaps using "simple" would be more sound here.

Moreover, AFAIK, the following other languages do not use Latin-based 
alphabets:

     arabic      arabic      \
     greek       greek       \
     nepali      nepali      \
     tamil       tamil       \

So I wonder by what rationale they use their own stemmer for the ASCII 
fallback, which is probably not going to produce anything significant.

What's the general idea here?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: snowball ASCII stemmer configuration

От

Tom Lane

Дата:

16 июня 2020 г., 16:53:46

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> There are two cases where these two columns are not the same:

>      hindi       english     \
>      russian     english     \

> The second one is old; the first one I added using the second one as 
> example.  But I wonder what the rationale for this is.  Maybe for hindi 
> one could make some kind of cultural argument, but for russian this 
> seems entirely arbitrary.

Perhaps it is, but we have actual Russians who think it's a good idea.
I recall questioning that point some years ago, and Oleg replied that
they'd done that intentionally because (a) technical Russian uses a lot
of English words, and (b) it's easy to tell which is which thanks to
the disjoint letter sets.

Whether the same is true for Hindi, I have no idea.

> Moreover, AFAIK, the following other languages do not use Latin-based 
> alphabets:

>      arabic      arabic      \
>      greek       greek       \
>      nepali      nepali      \
>      tamil       tamil       \

Hmm.  I think all of those entries are ones that got added by me while
absorbing post-2007 Snowball updates, and I confess that I did not think
about this point.  Maybe these should be changed.

            regards, tom lane

Re: snowball ASCII stemmer configuration

От

Oleg Bartunov

Дата:

16 июня 2020 г., 17:32:19

On Tue, Jun 16, 2020 at 4:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> There are two cases where these two columns are not the same:

> hindi english \
> russian english \

> The second one is old; the first one I added using the second one as
> example. But I wonder what the rationale for this is. Maybe for hindi
> one could make some kind of cultural argument, but for russian this
> seems entirely arbitrary.

Perhaps it is, but we have actual Russians who think it's a good idea.
I recall questioning that point some years ago, and Oleg replied that
they'd done that intentionally because (a) technical Russian uses a lot
of English words, and (b) it's easy to tell which is which thanks to
the disjoint letter sets.

Yes, you are right.

Whether the same is true for Hindi, I have no idea.

> Moreover, AFAIK, the following other languages do not use Latin-based
> alphabets:

> arabic arabic \
> greek greek \
> nepali nepali \
> tamil tamil \

Hmm. I think all of those entries are ones that got added by me while
absorbing post-2007 Snowball updates, and I confess that I did not think
about this point. Maybe these should be changed.

regards, tom lane

Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: snowball ASCII stemmer configuration

От

Tom Lane

Дата:

16 июня 2020 г., 17:37:17

I wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> Moreover, AFAIK, the following other languages do not use Latin-based 
>> alphabets:

>> arabic      arabic      \
>> greek       greek       \
>> nepali      nepali      \
>> tamil       tamil       \

> Hmm.  I think all of those entries are ones that got added by me while
> absorbing post-2007 Snowball updates, and I confess that I did not think
> about this point.  Maybe these should be changed.

After further reflection, I think these are indeed mistakes and we should
change them all.  The argument for the Russian/English case, AIUI, is
"if we come across an all-ASCII word, it is most certainly not Russian,
and the most likely Latin-based language is English".  Given the world
as it is, I think the same argument works for all non-Latin-alphabet
languages.  Obviously specific applications might have a different idea
of the best fallback language, but that's why we let users make their
own text search configurations.  For general-purpose use, falling back
to English seems reasonable.  And we can be dead certain that applying
a Greek stemmer to an ASCII word will do nothing useful, so the
configuration choice shown above is unhelpful.

            regards, tom lane

Re: snowball ASCII stemmer configuration

От

Mark Dilger

Дата:

16 июня 2020 г., 18:25:03

> On Jun 16, 2020, at 7:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>>> Moreover, AFAIK, the following other languages do not use Latin-based
>>> alphabets:
>
>>> arabic      arabic      \
>>> greek       greek       \
>>> nepali      nepali      \
>>> tamil       tamil       \
>
>> Hmm.  I think all of those entries are ones that got added by me while
>> absorbing post-2007 Snowball updates, and I confess that I did not think
>> about this point.  Maybe these should be changed.
>
> After further reflection, I think these are indeed mistakes and we should
> change them all.  The argument for the Russian/English case, AIUI, is
> "if we come across an all-ASCII word, it is most certainly not Russian,
> and the most likely Latin-based language is English".  Given the world
> as it is, I think the same argument works for all non-Latin-alphabet
> languages.  Obviously specific applications might have a different idea
> of the best fallback language, but that's why we let users make their
> own text search configurations.  For general-purpose use, falling back
> to English seems reasonable.  And we can be dead certain that applying
> a Greek stemmer to an ASCII word will do nothing useful, so the
> configuration choice shown above is unhelpful.

I am a bit surprised to see that you are right about this, because non-latin languages often have
transliteration/romanizationschemes for writing the language in the Latin alphabet, developed before computers had wide
spreadadoption of non-ASCII character sets, and still in use today for text messaging.  I expected to find stemming
rulesfor transliterated words, but can't find any indication of that, neither in the postgres sources, nor in the
snowballsources I pulled from their repo.  Is there some architectural separation of stemming from transliteration such
thatwe'd never need to worry about it?  If snowball ever published stemmers for transliterated text, we might have to
revisitthis issue, but for now your proposed change sounds fine to me. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snowball ASCII stemmer configuration

От

Tom Lane

Дата:

16 июня 2020 г., 18:40:37

Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I am a bit surprised to see that you are right about this, because non-latin languages often have
transliteration/romanizationschemes for writing the language in the Latin alphabet, developed before computers had wide
spreadadoption of non-ASCII character sets, and still in use today for text messaging.  I expected to find stemming
rulesfor transliterated words, but can't find any indication of that, neither in the postgres sources, nor in the
snowballsources I pulled from their repo.  Is there some architectural separation of stemming from transliteration such
thatwe'd never need to worry about it?  If snowball ever published stemmers for transliterated text, we might have to
revisitthis issue, but for now your proposed change sounds fine to me. 

Agreed, if the Snowball stemmers worked on romanized texts then the
situation would be different.  But they don't, AFAICS.  Don't know
if that is architectural, or a policy decision, or just lack of
round tuits.

The thing that I actually find a bit shaky in this area is our
architectural decision to route words to different dictionaries
depending on whether they are all-ASCII or not.  AIUI that was
done purely on the basis of the Russian/English case; it would
fail badly if say you wanted to separate Russian from French.
However, I have no great desire to revisit that design right now.

            regards, tom lane

Re: snowball ASCII stemmer configuration

От

Peter Eisentraut

Дата:

19 июня 2020 г., 12:46:53

On 2020-06-16 16:37, Tom Lane wrote:
> After further reflection, I think these are indeed mistakes and we should
> change them all.  The argument for the Russian/English case, AIUI, is
> "if we come across an all-ASCII word, it is most certainly not Russian,
> and the most likely Latin-based language is English".  Given the world
> as it is, I think the same argument works for all non-Latin-alphabet
> languages.  Obviously specific applications might have a different idea
> of the best fallback language, but that's why we let users make their
> own text search configurations.  For general-purpose use, falling back
> to English seems reasonable.  And we can be dead certain that applying
> a Greek stemmer to an ASCII word will do nothing useful, so the
> configuration choice shown above is unhelpful.

Do we *have* to have an ASCII stemmer that corresponds to an actual 
language?  Couldn't we use the simple stemmer or no stemmer at all?

In my experience, ASCII text in, say, Russian or Greek will typically be 
acronyms or brand names or the like, and there doesn't seem to be a 
great need to stem that kind of thing.  Just doing nothing seems at 
least as good.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: snowball ASCII stemmer configuration

От

Tom Lane

Дата:

19 июня 2020 г., 16:44:20

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Do we *have* to have an ASCII stemmer that corresponds to an actual 
> language?  Couldn't we use the simple stemmer or no stemmer at all?
> In my experience, ASCII text in, say, Russian or Greek will typically be 
> acronyms or brand names or the like, and there doesn't seem to be a 
> great need to stem that kind of thing.  Just doing nothing seems at 
> least as good.

Well, I have no horse in this race.  But the reason it's like this for
Russian is that Oleg, Teodor, and crew set it up that way ages ago.
I'd tend to defer to their opinion about what's the most usable
configuration for Russian.  You could certainly argue that the situation
is different for $other-language ... but without some hard evidence for
that position, making these cases all behave similarly seems like a
reasonable approach.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: snowball ASCII stemmer configuration