Обсуждение: BUG #10589: hungarian.stop file spelling error

Поиск
Список
Период
Сортировка

BUG #10589: hungarian.stop file spelling error

От
zsoros@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      10589
Logged by:          Sörös Zoltán
Email address:      zsoros@gmail.com
PostgreSQL version: 9.3.4
Operating system:   Linux
Description:

Hi!
The 'hungarian.stop' file (for tsearch, located in
src/backend/snowball/stopwords in the source tarball) contains the õ
('otilde' in HTML) character instead of the correct 'ő' character. (There
are 7 occuerences in this file.)

Our database uses latin2 encoding, where we use the correct 'ő' characters.
Here's an excerpt from today's log:

< 2014-06-10 08:49:24.416 CEST >ERROR:  character with byte sequence 0xc3
0xb5 in encoding "UTF8" has no equivalent in encoding "LATIN2"
< 2014-06-10 08:49:24.416 CEST >CONTEXT:  line 58 of configuration file
"/usr/pgsql-9.3/share/tsearch_data/hungarian.stop"

After I replaced the tilde-capped letters in hungarian.stop file, the
problem vanished, and tsearch works fine.
I'm sorry, I can't give you the utf8 byte sequence for 'ő', but I can send
the corrected hungarian.stop file if needed.

Please fix this file in the next release.

Thanks in advance,
Zoltán Sörös

Re: BUG #10589: hungarian.stop file spelling error

От
Kevin Grittner
Дата:
"zsoros@gmail.com" <zsoros@gmail.com> wrote:=0A=0A> I'm sorry, I can't give=
 you the utf8 byte sequence for '=C5=91'=0A=0AA quick copy/paste from your =
email into psql (using UTF-8 encoding)=0Ashows:=0A=0Atest=3D# select to_hex=
(ascii('=C5=91'));=0A=C2=A0to_hex=0A--------=0A=C2=A0151=0A(1 row)=0A=0Ates=
t=3D# select E'\u0151', convert_to(E'\u0151', 'UTF8');=0A=C2=A0?column? | c=
onvert_to =0A----------+------------=0A=C2=A0=C5=91=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 | \xc591=0A(1 row)=0A=0A--=0AKevin Grittner=0AEDB: http:=
//www.enterprisedb.com=0AThe Enterprise PostgreSQL Company

Re: BUG #10589: hungarian.stop file spelling error

От
Tom Lane
Дата:
Kevin Grittner <kgrittn@ymail.com> writes:
> "zsoros@gmail.com" <zsoros@gmail.com> wrote:
>> I'm sorry, I can't give you the utf8 byte sequence for 'ő'

> A quick copy/paste from your email into psql (using UTF-8 encoding)
> shows:
> [ it's U+0151 ]

I believe that the way we got this file in the first place was to
scrape it from
http://snowball.tartarus.org/algorithms/hungarian/stop.txt
since it's not in the Snowball distribution.  It looks to me like the
webserver delivers that page in LATIN1 (ISO-8859-1) encoding, which would
go far towards explaining the encoding problem, since U+0151 isn't
representable in LATIN1.  So now I'm wondering what other similar mistakes
there may be in the non-LATIN1 languages.

I have an inquiry in to the upstream Snowball list asking if there's a
safer way to obtain copies of their stopword files.

            regards, tom lane

Re: BUG #10589: hungarian.stop file spelling error

От
Tom Lane
Дата:
I wrote:
> [ we seem to have gotten a misencoded version of hungarian.stop ]

Actually, it looks like things are even worse than that: the Hungarian
stemmer code seems to be confused about this too.  In the first place,
we've got a LATIN1 version of that stemmer, which I would imagine is
entirely useless; and in the second place, the UTF8 version has no
reference to any non-LATIN1 characters.

Again, I'm suspecting this problem goes further than Hungarian,
because the set of stem_ISO_8859_1_foo.c files in
src/backend/snowball/libstemmer/ covers a lot more languages than
I think LATIN1 is meant to cope with.  I'm not sure how much of this
is broken in the original Snowball code and how much is our error
while importing the code.

            regards, tom lane

Re: BUG #10589: hungarian.stop file spelling error

От
Tom Lane
Дата:
I wrote:
>> [ we seem to have gotten a misencoded version of hungarian.stop ]

> Actually, it looks like things are even worse than that: the Hungarian
> stemmer code seems to be confused about this too.  In the first place,
> we've got a LATIN1 version of that stemmer, which I would imagine is
> entirely useless; and in the second place, the UTF8 version has no
> reference to any non-LATIN1 characters.

> Again, I'm suspecting this problem goes further than Hungarian,
> because the set of stem_ISO_8859_1_foo.c files in
> src/backend/snowball/libstemmer/ covers a lot more languages than
> I think LATIN1 is meant to cope with.  I'm not sure how much of this
> is broken in the original Snowball code and how much is our error
> while importing the code.

After further analysis, it appears that:

1. The cause of the immediately complained-of problem is that we took
the stopword file we got from the Snowball website to be in LATIN1,
whereas it evidently was meant to be in LATIN2.  The problematic
characters were code 0xF5 in the file, which we translated to U+00F5,
but the correct translation is U+0151.  (There is another discrepancy
between LATIN1 and LATIN2 at code point 0xFB, but by chance there are
none of those in the stopword file.)

2. The Snowball people were just as confused as we were about the
appropriate encoding to use for Hungarian: their code claims that the
Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII
character codes used in it:

/* special characters (in ISO Latin I) */

stringdef a'  hex 'E1'  //a-acute
stringdef e'  hex 'E9'  //e-acute
stringdef i'  hex 'ED'  //i-acute
stringdef o'  hex 'F3'  //o-acute
stringdef o"  hex 'F6'  //o-umlaut
stringdef oq  hex 'F5'  //o-double acute
stringdef u'  hex 'FA'  //u-acute
stringdef u"  hex 'FC'  //u-umlaut
stringdef uq  hex 'FB'  //u-double acute

Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute
and u-double-acute don't appear in LATIN1 at all, and the codes shown here
are really for LATIN2.

I've reported this issue upstream and there are fixes pending.

3. While I was concerned that there might be similar bugs in the other
Snowball stemmers, it appears after a bit of research that LATIN1 is
commonly used as an encoding for all the other languages the Snowball
code claims it can be used for, even though in a few cases there are
seldom-used characters that LATIN1 can't represent.  So there's not a
clear reason to think there are any other undetected problems (and
I would certainly not be the man to find them if they exist).


I've gone ahead and committed the encoding fix for hungarian.stop in all
active branches.  I'm going to wait for Snowball upstream to accept the
proposed patches before I think about incorporating the code changes.

I'm not real sure whether we should consider back-patching those changes.
Right now, the Hungarian stemmer is applying rules meant for
o-double-acute to o-tilde, which probably means that those stemming rules
don't fire at all on actual Hungarian text.  If we fix that then the
stemmer will behave differently, which might not be all that desirable to
change in a minor release.  Perhaps we should only make the code changes
in HEAD and 9.4?

            regards, tom lane

Re: BUG #10589: hungarian.stop file spelling error

От
Gavin Flower
Дата:
On 11/06/14 15:09, Tom Lane wrote:
> I wrote:
>>> [ we seem to have gotten a misencoded version of hungarian.stop ]
>> Actually, it looks like things are even worse than that: the Hungarian
>> stemmer code seems to be confused about this too.  In the first place,
>> we've got a LATIN1 version of that stemmer, which I would imagine is
>> entirely useless; and in the second place, the UTF8 version has no
>> reference to any non-LATIN1 characters.
>> Again, I'm suspecting this problem goes further than Hungarian,
>> because the set of stem_ISO_8859_1_foo.c files in
>> src/backend/snowball/libstemmer/ covers a lot more languages than
>> I think LATIN1 is meant to cope with.  I'm not sure how much of this
>> is broken in the original Snowball code and how much is our error
>> while importing the code.
> After further analysis, it appears that:
>
> 1. The cause of the immediately complained-of problem is that we took
> the stopword file we got from the Snowball website to be in LATIN1,
> whereas it evidently was meant to be in LATIN2.  The problematic
> characters were code 0xF5 in the file, which we translated to U+00F5,
> but the correct translation is U+0151.  (There is another discrepancy
> between LATIN1 and LATIN2 at code point 0xFB, but by chance there are
> none of those in the stopword file.)
>
> 2. The Snowball people were just as confused as we were about the
> appropriate encoding to use for Hungarian: their code claims that the
> Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII
> character codes used in it:
>
> /* special characters (in ISO Latin I) */
>
> stringdef a'  hex 'E1'  //a-acute
> stringdef e'  hex 'E9'  //e-acute
> stringdef i'  hex 'ED'  //i-acute
> stringdef o'  hex 'F3'  //o-acute
> stringdef o"  hex 'F6'  //o-umlaut
> stringdef oq  hex 'F5'  //o-double acute
> stringdef u'  hex 'FA'  //u-acute
> stringdef u"  hex 'FC'  //u-umlaut
> stringdef uq  hex 'FB'  //u-double acute
>
> Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute
> and u-double-acute don't appear in LATIN1 at all, and the codes shown here
> are really for LATIN2.
>
> I've reported this issue upstream and there are fixes pending.
>
> 3. While I was concerned that there might be similar bugs in the other
> Snowball stemmers, it appears after a bit of research that LATIN1 is
> commonly used as an encoding for all the other languages the Snowball
> code claims it can be used for, even though in a few cases there are
> seldom-used characters that LATIN1 can't represent.  So there's not a
> clear reason to think there are any other undetected problems (and
> I would certainly not be the man to find them if they exist).
>
>
> I've gone ahead and committed the encoding fix for hungarian.stop in all
> active branches.  I'm going to wait for Snowball upstream to accept the
> proposed patches before I think about incorporating the code changes.
>
> I'm not real sure whether we should consider back-patching those changes.
> Right now, the Hungarian stemmer is applying rules meant for
> o-double-acute to o-tilde, which probably means that those stemming rules
> don't fire at all on actual Hungarian text.  If we fix that then the
> stemmer will behave differently, which might not be all that desirable to
> change in a minor release.  Perhaps we should only make the code changes
> in HEAD and 9.4?
>
>             regards, tom lane
>
>
Not saying there is any problem, but you might like to check how the EUR
currency symbol is handled (it is in LATIN2, but not in LATIN1):

https://en.wikipedia.org/wiki/Euro_sign
U+20AC ¤ euro sign
(HTML: |€| |€|)


Cheers,
Gavin

Re: BUG #10589: hungarian.stop file spelling error

От
Alvaro Herrera
Дата:
Gavin Flower wrote:

> Not saying there is any problem, but you might like to check how the
> EUR currency symbol is handled (it is in LATIN2, but not in LATIN1):
>
> https://en.wikipedia.org/wiki/Euro_sign
> U+20AC € euro sign
> (HTML: |€| |€|)

Latin1 doesn't have euro, which is why Latin9 (iso-8859-15) was invented
IIUC.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #10589: hungarian.stop file spelling error

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Gavin Flower wrote:
>> Not saying there is any problem, but you might like to check how the
>> EUR currency symbol is handled (it is in LATIN2, but not in LATIN1):

> Latin1 doesn't have euro, which is why Latin9 (iso-8859-15) was invented
> IIUC.

Yeah, I doubt there's much to be learned from the euro-sign case.
The Snowball stemmers certainly don't care about euro --- they
only work with alphabetic characters.

Actually, an interesting point is that we could probably use one of the
single-byte-encoding LATIN1 stemmers when the database encoding is LATIN9,
and thereby save a translation to UTF8 and back, since the stemmer logic
isn't going to care about euro signs.  Likewise for LATIN2 vs LATIN10.
Not sure it's worth the trouble though.

            regards, tom lane

Re: BUG #10589: hungarian.stop file spelling error

От
Bruce Momjian
Дата:
On Tue, Jun 10, 2014 at 11:09:22PM -0400, Tom Lane wrote:
> I'm not real sure whether we should consider back-patching those changes.
> Right now, the Hungarian stemmer is applying rules meant for
> o-double-acute to o-tilde, which probably means that those stemming rules
> don't fire at all on actual Hungarian text.  If we fix that then the
> stemmer will behave differently, which might not be all that desirable to
> change in a minor release.  Perhaps we should only make the code changes
> in HEAD and 9.4?

Does this affect any tsvectors stored in earlier major releases that
would read differently after this patch?  Does it cause a pg_upgrade
problem?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: BUG #10589: hungarian.stop file spelling error

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> On Tue, Jun 10, 2014 at 11:09:22PM -0400, Tom Lane wrote:
>> I'm not real sure whether we should consider back-patching those changes.
>> Right now, the Hungarian stemmer is applying rules meant for
>> o-double-acute to o-tilde, which probably means that those stemming rules
>> don't fire at all on actual Hungarian text.  If we fix that then the
>> stemmer will behave differently, which might not be all that desirable to
>> change in a minor release.  Perhaps we should only make the code changes
>> in HEAD and 9.4?

> Does this affect any tsvectors stored in earlier major releases that
> would read differently after this patch?  Does it cause a pg_upgrade
> problem?

My guess is the field usage of the Hungarian stemmer is near zero,
or somebody would've complained about this before.  Hence, I'm not
thinking we should expend any huge effort to work around problems.

In any case, Oleg and Teodor have opined in the past that small changes
in dictionary behavior don't cause major practical problems; the worst
case is that some words aren't found by searches because the current
dictionary normalizes them differently than what's in the index.
You can get around that if you have to by entering the tsquery manually
rather than going through to_tsquery.

            regards, tom lane