Обсуждение: BUG #10589: hungarian.stop file spelling error
The following bug has been logged on the website: Bug reference: 10589 Logged by: Sörös Zoltán Email address: zsoros@gmail.com PostgreSQL version: 9.3.4 Operating system: Linux Description: Hi! The 'hungarian.stop' file (for tsearch, located in src/backend/snowball/stopwords in the source tarball) contains the õ ('otilde' in HTML) character instead of the correct 'Å' character. (There are 7 occuerences in this file.) Our database uses latin2 encoding, where we use the correct 'Å' characters. Here's an excerpt from today's log: < 2014-06-10 08:49:24.416 CEST >ERROR: character with byte sequence 0xc3 0xb5 in encoding "UTF8" has no equivalent in encoding "LATIN2" < 2014-06-10 08:49:24.416 CEST >CONTEXT: line 58 of configuration file "/usr/pgsql-9.3/share/tsearch_data/hungarian.stop" After I replaced the tilde-capped letters in hungarian.stop file, the problem vanished, and tsearch works fine. I'm sorry, I can't give you the utf8 byte sequence for 'Å', but I can send the corrected hungarian.stop file if needed. Please fix this file in the next release. Thanks in advance, Zoltán Sörös
"zsoros@gmail.com" <zsoros@gmail.com> wrote:=0A=0A> I'm sorry, I can't give= you the utf8 byte sequence for '=C5=91'=0A=0AA quick copy/paste from your = email into psql (using UTF-8 encoding)=0Ashows:=0A=0Atest=3D# select to_hex= (ascii('=C5=91'));=0A=C2=A0to_hex=0A--------=0A=C2=A0151=0A(1 row)=0A=0Ates= t=3D# select E'\u0151', convert_to(E'\u0151', 'UTF8');=0A=C2=A0?column? | c= onvert_to =0A----------+------------=0A=C2=A0=C5=91=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 | \xc591=0A(1 row)=0A=0A--=0AKevin Grittner=0AEDB: http:= //www.enterprisedb.com=0AThe Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> writes: > "zsoros@gmail.com" <zsoros@gmail.com> wrote: >> I'm sorry, I can't give you the utf8 byte sequence for 'Å' > A quick copy/paste from your email into psql (using UTF-8 encoding) > shows: > [ it's U+0151 ] I believe that the way we got this file in the first place was to scrape it from http://snowball.tartarus.org/algorithms/hungarian/stop.txt since it's not in the Snowball distribution. It looks to me like the webserver delivers that page in LATIN1 (ISO-8859-1) encoding, which would go far towards explaining the encoding problem, since U+0151 isn't representable in LATIN1. So now I'm wondering what other similar mistakes there may be in the non-LATIN1 languages. I have an inquiry in to the upstream Snowball list asking if there's a safer way to obtain copies of their stopword files. regards, tom lane
I wrote: > [ we seem to have gotten a misencoded version of hungarian.stop ] Actually, it looks like things are even worse than that: the Hungarian stemmer code seems to be confused about this too. In the first place, we've got a LATIN1 version of that stemmer, which I would imagine is entirely useless; and in the second place, the UTF8 version has no reference to any non-LATIN1 characters. Again, I'm suspecting this problem goes further than Hungarian, because the set of stem_ISO_8859_1_foo.c files in src/backend/snowball/libstemmer/ covers a lot more languages than I think LATIN1 is meant to cope with. I'm not sure how much of this is broken in the original Snowball code and how much is our error while importing the code. regards, tom lane
I wrote: >> [ we seem to have gotten a misencoded version of hungarian.stop ] > Actually, it looks like things are even worse than that: the Hungarian > stemmer code seems to be confused about this too. In the first place, > we've got a LATIN1 version of that stemmer, which I would imagine is > entirely useless; and in the second place, the UTF8 version has no > reference to any non-LATIN1 characters. > Again, I'm suspecting this problem goes further than Hungarian, > because the set of stem_ISO_8859_1_foo.c files in > src/backend/snowball/libstemmer/ covers a lot more languages than > I think LATIN1 is meant to cope with. I'm not sure how much of this > is broken in the original Snowball code and how much is our error > while importing the code. After further analysis, it appears that: 1. The cause of the immediately complained-of problem is that we took the stopword file we got from the Snowball website to be in LATIN1, whereas it evidently was meant to be in LATIN2. The problematic characters were code 0xF5 in the file, which we translated to U+00F5, but the correct translation is U+0151. (There is another discrepancy between LATIN1 and LATIN2 at code point 0xFB, but by chance there are none of those in the stopword file.) 2. The Snowball people were just as confused as we were about the appropriate encoding to use for Hungarian: their code claims that the Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII character codes used in it: /* special characters (in ISO Latin I) */ stringdef a' hex 'E1' //a-acute stringdef e' hex 'E9' //e-acute stringdef i' hex 'ED' //i-acute stringdef o' hex 'F3' //o-acute stringdef o" hex 'F6' //o-umlaut stringdef oq hex 'F5' //o-double acute stringdef u' hex 'FA' //u-acute stringdef u" hex 'FC' //u-umlaut stringdef uq hex 'FB' //u-double acute Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute and u-double-acute don't appear in LATIN1 at all, and the codes shown here are really for LATIN2. I've reported this issue upstream and there are fixes pending. 3. While I was concerned that there might be similar bugs in the other Snowball stemmers, it appears after a bit of research that LATIN1 is commonly used as an encoding for all the other languages the Snowball code claims it can be used for, even though in a few cases there are seldom-used characters that LATIN1 can't represent. So there's not a clear reason to think there are any other undetected problems (and I would certainly not be the man to find them if they exist). I've gone ahead and committed the encoding fix for hungarian.stop in all active branches. I'm going to wait for Snowball upstream to accept the proposed patches before I think about incorporating the code changes. I'm not real sure whether we should consider back-patching those changes. Right now, the Hungarian stemmer is applying rules meant for o-double-acute to o-tilde, which probably means that those stemming rules don't fire at all on actual Hungarian text. If we fix that then the stemmer will behave differently, which might not be all that desirable to change in a minor release. Perhaps we should only make the code changes in HEAD and 9.4? regards, tom lane
On 11/06/14 15:09, Tom Lane wrote: > I wrote: >>> [ we seem to have gotten a misencoded version of hungarian.stop ] >> Actually, it looks like things are even worse than that: the Hungarian >> stemmer code seems to be confused about this too. In the first place, >> we've got a LATIN1 version of that stemmer, which I would imagine is >> entirely useless; and in the second place, the UTF8 version has no >> reference to any non-LATIN1 characters. >> Again, I'm suspecting this problem goes further than Hungarian, >> because the set of stem_ISO_8859_1_foo.c files in >> src/backend/snowball/libstemmer/ covers a lot more languages than >> I think LATIN1 is meant to cope with. I'm not sure how much of this >> is broken in the original Snowball code and how much is our error >> while importing the code. > After further analysis, it appears that: > > 1. The cause of the immediately complained-of problem is that we took > the stopword file we got from the Snowball website to be in LATIN1, > whereas it evidently was meant to be in LATIN2. The problematic > characters were code 0xF5 in the file, which we translated to U+00F5, > but the correct translation is U+0151. (There is another discrepancy > between LATIN1 and LATIN2 at code point 0xFB, but by chance there are > none of those in the stopword file.) > > 2. The Snowball people were just as confused as we were about the > appropriate encoding to use for Hungarian: their code claims that the > Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII > character codes used in it: > > /* special characters (in ISO Latin I) */ > > stringdef a' hex 'E1' //a-acute > stringdef e' hex 'E9' //e-acute > stringdef i' hex 'ED' //i-acute > stringdef o' hex 'F3' //o-acute > stringdef o" hex 'F6' //o-umlaut > stringdef oq hex 'F5' //o-double acute > stringdef u' hex 'FA' //u-acute > stringdef u" hex 'FC' //u-umlaut > stringdef uq hex 'FB' //u-double acute > > Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute > and u-double-acute don't appear in LATIN1 at all, and the codes shown here > are really for LATIN2. > > I've reported this issue upstream and there are fixes pending. > > 3. While I was concerned that there might be similar bugs in the other > Snowball stemmers, it appears after a bit of research that LATIN1 is > commonly used as an encoding for all the other languages the Snowball > code claims it can be used for, even though in a few cases there are > seldom-used characters that LATIN1 can't represent. So there's not a > clear reason to think there are any other undetected problems (and > I would certainly not be the man to find them if they exist). > > > I've gone ahead and committed the encoding fix for hungarian.stop in all > active branches. I'm going to wait for Snowball upstream to accept the > proposed patches before I think about incorporating the code changes. > > I'm not real sure whether we should consider back-patching those changes. > Right now, the Hungarian stemmer is applying rules meant for > o-double-acute to o-tilde, which probably means that those stemming rules > don't fire at all on actual Hungarian text. If we fix that then the > stemmer will behave differently, which might not be all that desirable to > change in a minor release. Perhaps we should only make the code changes > in HEAD and 9.4? > > regards, tom lane > > Not saying there is any problem, but you might like to check how the EUR currency symbol is handled (it is in LATIN2, but not in LATIN1): https://en.wikipedia.org/wiki/Euro_sign U+20AC ¤ euro sign (HTML: |€| |€|) Cheers, Gavin
Gavin Flower wrote: > Not saying there is any problem, but you might like to check how the > EUR currency symbol is handled (it is in LATIN2, but not in LATIN1): > > https://en.wikipedia.org/wiki/Euro_sign > U+20AC ⬠euro sign > (HTML: |€| |€|) Latin1 doesn't have euro, which is why Latin9 (iso-8859-15) was invented IIUC. -- Ãlvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Gavin Flower wrote: >> Not saying there is any problem, but you might like to check how the >> EUR currency symbol is handled (it is in LATIN2, but not in LATIN1): > Latin1 doesn't have euro, which is why Latin9 (iso-8859-15) was invented > IIUC. Yeah, I doubt there's much to be learned from the euro-sign case. The Snowball stemmers certainly don't care about euro --- they only work with alphabetic characters. Actually, an interesting point is that we could probably use one of the single-byte-encoding LATIN1 stemmers when the database encoding is LATIN9, and thereby save a translation to UTF8 and back, since the stemmer logic isn't going to care about euro signs. Likewise for LATIN2 vs LATIN10. Not sure it's worth the trouble though. regards, tom lane
On Tue, Jun 10, 2014 at 11:09:22PM -0400, Tom Lane wrote: > I'm not real sure whether we should consider back-patching those changes. > Right now, the Hungarian stemmer is applying rules meant for > o-double-acute to o-tilde, which probably means that those stemming rules > don't fire at all on actual Hungarian text. If we fix that then the > stemmer will behave differently, which might not be all that desirable to > change in a minor release. Perhaps we should only make the code changes > in HEAD and 9.4? Does this affect any tsvectors stored in earlier major releases that would read differently after this patch? Does it cause a pg_upgrade problem? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian <bruce@momjian.us> writes: > On Tue, Jun 10, 2014 at 11:09:22PM -0400, Tom Lane wrote: >> I'm not real sure whether we should consider back-patching those changes. >> Right now, the Hungarian stemmer is applying rules meant for >> o-double-acute to o-tilde, which probably means that those stemming rules >> don't fire at all on actual Hungarian text. If we fix that then the >> stemmer will behave differently, which might not be all that desirable to >> change in a minor release. Perhaps we should only make the code changes >> in HEAD and 9.4? > Does this affect any tsvectors stored in earlier major releases that > would read differently after this patch? Does it cause a pg_upgrade > problem? My guess is the field usage of the Hungarian stemmer is near zero, or somebody would've complained about this before. Hence, I'm not thinking we should expend any huge effort to work around problems. In any case, Oleg and Teodor have opined in the past that small changes in dictionary behavior don't cause major practical problems; the worst case is that some words aren't found by searches because the current dictionary normalizes them differently than what's in the index. You can get around that if you have to by entering the tsquery manually rather than going through to_tsquery. regards, tom lane