Обсуждение: [Fwd: Re: tsearch in core patch]

Поиск
Список
Период
Сортировка

[Fwd: Re: tsearch in core patch]

От
teodor@sigaev.ru
Дата:
>> How would this work for initdb with locale C?
>
> I'm worrying about that too.

english '{en_GB, en_US, C}'

I suppose, that locale name always has a dot separator exept C locale ---
which is well known exception





Re: [Fwd: Re: tsearch in core patch]

От
Tatsuo Ishii
Дата:
> >> How would this work for initdb with locale C?
> >
> > I'm worrying about that too.
> 
> english '{en_GB, en_US, C}'
> 
> I suppose, that locale name always has a dot separator exept C locale ---
> which is well known exception

So we would have to?:

japanese '{ja_JP, C}'

How would we know C -> japanese?

Also I'm wondering how we could handle texts including Japanese and
English. It's very common in Japan.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: [Fwd: Re: tsearch in core patch]

От
Euler Taveira de Oliveira
Дата:
Tatsuo Ishii wrote:

> japanese '{ja_JP, C}'
> 
> How would we know C -> japanese?
> 
You can't do that. You can't have different languages (not locales)
mapping to the same 'tsearch language' because the stemmer doesn't know
that a specific word is in english or japanese. So you have two options:
(a) disable stemming (b) leave the language set to 'japanese' and see if
it plays well.


--  Euler Taveira de Oliveira http://www.timbira.com/


Re: [Fwd: Re: tsearch in core patch]

От
Tatsuo Ishii
Дата:
> Tatsuo Ishii wrote:
> 
> > japanese '{ja_JP, C}'
> > 
> > How would we know C -> japanese?
> > 
> You can't do that. You can't have different languages (not locales)
> mapping to the same 'tsearch language' because the stemmer doesn't know
> that a specific word is in english or japanese. So you have two options:
> (a) disable stemming (b) leave the language set to 'japanese' and see if
> it plays well.

Ok, probably we need to copy the English stemming rule to the one for
Japanese. I think same thing (commonly used English with local
language) can be applied to Chinese and Korean.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: [Fwd: Re: tsearch in core patch]

От
Tom Lane
Дата:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> Ok, probably we need to copy the English stemming rule to the one for
> Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean?  What little I know about ideographic
languages suggests it wouldn't work well.  And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

> I think same thing (commonly used English with local
> language) can be applied to Chinese and Korean.

Well, it's not hard at all to find chunks of English text that have
embedded bits of French, Spanish, or what-have-you, but that's not an
argument for trying to intermix the stemmers.  I doubt that such simple
bits of program could tell the language difference well enough to
determine which stemming rules to apply.
        regards, tom lane


Re: [Fwd: Re: tsearch in core patch]

От
Tatsuo Ishii
Дата:
> Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> > Ok, probably we need to copy the English stemming rule to the one for
> > Japanese.
> 
> Pardon my ignorance here, but is the concept of stemming even relevant
> to Japanese/Chinese/Korean?  What little I know about ideographic
> languages suggests it wouldn't work well.  And surely the specific rules
> in the Snowball project's English stemmer wouldn't work.

Your undestanding is correct. English stemmer would not work for
Japanese "non English" part.

What I meant was the "chunks of English text" in Japanese.

> > I think same thing (commonly used English with local
> > language) can be applied to Chinese and Korean.
> 
> Well, it's not hard at all to find chunks of English text that have
> embedded bits of French, Spanish, or what-have-you, but that's not an
> argument for trying to intermix the stemmers.  I doubt that such simple
> bits of program could tell the language difference well enough to
> determine which stemming rules to apply.

For Japanese, it will be fairly simple: 7bit ASCII range words must be
English (Note that mostly used Japanese encodings such as EUC do not
allow to mix with ISO 8859).
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: [Fwd: Re: tsearch in core patch]

От
"Mike Rylander"
Дата:
On 6/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, it's not hard at all to find chunks of English text that have
> embedded bits of French, Spanish, or what-have-you, but that's not an
> argument for trying to intermix the stemmers.  I doubt that such simple
> bits of program could tell the language difference well enough to
> determine which stemming rules to apply.
>

While I imagine that is probably true of many, if not most, my project
in particular would greatly benefit from the ability to mix stemmers.
I work with complex bibliographic data, which has language information
embedded within records.  This is not limited to the record level
either.  Individual fields within each bibliographic record can be in
different langauges.

Especially in countries where making software multi-lingual (such as
Canada (en_CA/fr_CA)) is a requirement for use in public institutions,
the ability to choose a stemmer and stop-word list at will for any
particular record will actually provide the exact behavior needed.
The obvious generalization from Canada would be to support any mix of
languages supported by tsearch2.

I can certainly understand the benefit of making the default
configuration a simple locale to language map, but there are
definitely uses for searching using different stemmers/stop-lists even
within the same corpus/index.  So, as a datapoint for the discussion,
I would ask that the option of multiple languages per DB locale not be
removed if it can be at all avoided.

Thanks for listening (and for all the great work on getting tsearch
into core! :) ...

-- 
Mike Rylander


Re: [Fwd: Re: tsearch in core patch]

От
Tom Lane
Дата:
"Mike Rylander" <mrylander@gmail.com> writes:
> I can certainly understand the benefit of making the default
> configuration a simple locale to language map, but there are
> definitely uses for searching using different stemmers/stop-lists even
> within the same corpus/index.  So, as a datapoint for the discussion,
> I would ask that the option of multiple languages per DB locale not be
> removed if it can be at all avoided.

Nobody is proposing that --- the issue here is just how we set up the
"default" configuration.
        regards, tom lane


Re: [Fwd: Re: tsearch in core patch]

От
"Mike Rylander"
Дата:
On 6/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Mike Rylander" <mrylander@gmail.com> writes:
> > I can certainly understand the benefit of making the default
> > configuration a simple locale to language map, but there are
> > definitely uses for searching using different stemmers/stop-lists even
> > within the same corpus/index.  So, as a datapoint for the discussion,
> > I would ask that the option of multiple languages per DB locale not be
> > removed if it can be at all avoided.
>
> Nobody is proposing that --- the issue here is just how we set up the
> "default" configuration.
>

Then I misunderstood.  Sorry for the noise, folks.

-- 
Mike Rylander


Re: [Fwd: Re: tsearch in core patch]

От
Josh Berkus
Дата:
Ishii-san,

>>> Ok, probably we need to copy the English stemming rule to the one for
>>> Japanese.
>> Pardon my ignorance here, but is the concept of stemming even relevant
>> to Japanese/Chinese/Korean?  What little I know about ideographic
>> languages suggests it wouldn't work well.  And surely the specific rules
>> in the Snowball project's English stemmer wouldn't work.
> 
> Your undestanding is correct. English stemmer would not work for
> Japanese "non English" part.

That reminds me, don't you guys have your own full text search for 
Japanese?  Planning on merging it with the core code anytime soon?

--Josh


Re: [Fwd: Re: tsearch in core patch]

От
Tatsuo Ishii
Дата:
> Ishii-san,
> 
> >>> Ok, probably we need to copy the English stemming rule to the one for
> >>> Japanese.
> >> Pardon my ignorance here, but is the concept of stemming even relevant
> >> to Japanese/Chinese/Korean?  What little I know about ideographic
> >> languages suggests it wouldn't work well.  And surely the specific rules
> >> in the Snowball project's English stemmer wouldn't work.
> > 
> > Your undestanding is correct. English stemmer would not work for
> > Japanese "non English" part.
> 
> That reminds me, don't you guys have your own full text search for 
> Japanese?  Planning on merging it with the core code anytime soon?

No. Actually Japanese (non English part) does not need stemming at
all. However, since Japanese is an agglutinative language, we have to
break continuous Japanese string into space separated "words". For
example, we need to break:

todayisfine

into:

today is fine

(of course those English are just for non-Japanese spearker's
understanding, actually they are Japanese).

For this we need good dictionary and software. Fortunately we have
several kinds of open source softwares for this pupose. Once I have
written a PostgreSQL C function envoking one of these software to do
the work and it works great with tsearch2.
--
Tatsuo Ishii
SRA OSS, Inc. Japan