Обсуждение: BUG #3730: Creating a swedish dictionary fails

Поиск
Список
Период
Сортировка

BUG #3730: Creating a swedish dictionary fails

От
"Penty Wenngren"
Дата:
The following bug has been logged online:

Bug reference:      3730
Logged by:          Penty Wenngren
Email address:      penty.wenngren@dgc.se
PostgreSQL version: 8.3 BETA 2
Operating system:   FreeBSD 7 BETA 2
Description:        Creating a swedish dictionary fails
Details:

I'm trying to create a swedish dictionary for tsearch as specified in the
8.3 manual, but it breaks:

test=# CREATE TEXT SEARCH DICTIONARY swedish_ispell (
test(# TEMPLATE = ispell,
test(# DictFile = swedish,
test(# AffFile = swedish,
test(# StopWords = swedish);
FEL:  syntax error at line 219 of affix file
"/usr/local/share/postgresql/tsearch_data/swedish.affix"

picard# pwd
/usr/local/share/postgresql/tsearch_data

picard# file swedish.*
swedish.affix:        UTF-8 Unicode text
swedish.dict:         UTF-8 Unicode text
swedish.stop:         UTF-8 Unicode text

Line 219 in the affix file looks like this:

        O T     >       -OT,\xc3\x96TTER


Please forgive me if this is a known problem with a known solution. I
haven't been able to find information about this particular issue regarding
swedish dictionaries.

// Penty

Re: BUG #3730: Creating a swedish dictionary fails

От
Tom Lane
Дата:
"Penty Wenngren" <penty.wenngren@dgc.se> writes:
> I'm trying to create a swedish dictionary for tsearch as specified in the
> 8.3 manual, but it breaks:

Can you point us to copies of the swedish files you used?

            regards, tom lane

Re: BUG #3730: Creating a swedish dictionary fails

От
Penty Wenngren
Дата:
On Thu, Nov 08, 2007 at 11:42:55AM -0500, Tom Lane wrote:
> "Penty Wenngren" <penty.wenngren@dgc.se> writes:
> > I'm trying to create a swedish dictionary for tsearch as specified in the
> > 8.3 manual, but it breaks:
>
> Can you point us to copies of the swedish files you used?
>

I just realized I only replied to Tom, so here goes again.

I used iconv to convert svenska.aff and svenska.datalist (from
iswedish-1.2.1) to UTF-8. The converted files can be found at:

http://www.lederhosen.org/swedish.affix
http://www.lederhosen.org/swedish.dict

// Penty

--

Penty Wenngren
DGC Solutions AB

Re: BUG #3730: Creating a swedish dictionary fails

От
Tom Lane
Дата:
Penty Wenngren <penty.wenngren@dgc.se> writes:
> I used iconv to convert svenska.aff and svenska.datalist (from
> iswedish-1.2.1) to UTF-8. The converted files can be found at:
> http://www.lederhosen.org/swedish.affix
> http://www.lederhosen.org/swedish.dict

I think the reason it's failing right there is that that line is the
first affix rule containing a non-ASCII letter, and the rules are
supposed to only contain letters and certain specific punctuation.
I suspect you are working in a locale that doesn't think Ö is a
letter --- check lc_ctype.

            regards, tom lane

Re: BUG #3730: Creating a swedish dictionary fails

От
Tom Lane
Дата:
Penty Wenngren <penty.wenngren@dgc.se> writes:
> On Thu, Nov 08, 2007 at 05:21:17PM -0500, Tom Lane wrote:
>> I suspect you are working in a locale that doesn't think Ö is a
>> letter --- check lc_ctype.

> It doesn't seem to make any difference. The first try was done from a
> terminal that didn't care much for UTF-8, but that is fixed now and I
> still get the same result.

It sorta looks to me like you only changed the locale of your terminal
session.  Changing the database's locale requires re-initdb.  What
does "show lc_ctype" show within Postgres?

            regards, tom lane

Re: BUG #3730: Creating a swedish dictionary fails

От
Penty Wenngren
Дата:
On Thu, Nov 08, 2007 at 05:21:17PM -0500, Tom Lane wrote:
> Penty Wenngren <penty.wenngren@dgc.se> writes:
> > I used iconv to convert svenska.aff and svenska.datalist (from
> > iswedish-1.2.1) to UTF-8. The converted files can be found at:
> > http://www.lederhosen.org/swedish.affix
> > http://www.lederhosen.org/swedish.dict
>
> I think the reason it's failing right there is that that line is the
> first affix rule containing a non-ASCII letter, and the rules are
> supposed to only contain letters and certain specific punctuation.
> I suspect you are working in a locale that doesn't think Ö is a
> letter --- check lc_ctype.
>

It doesn't seem to make any difference. The first try was done from a
terminal that didn't care much for UTF-8, but that is fixed now and I
still get the same result. Could it be that iconv's conversion is
broken then, or that I did something terribly wrong in the conversion
process (iconv -f ISO-8859-1 -t UTF-8 svenska.aff > swedish.affix)?

$ echo $LANG
sv_SE.UTF-8

$ echo $LC_CTYPE
sv_SE.UTF-8

$ psql test
Välkommen till psql 8.3beta2, den interaktiva PostgreSQL-terminalen.

Skriv:  \copyright för upphovsrättsinformation
        \h för hjälp om SQL-kommandon
        \? för hjälp om psql-kommandon
        \g eller avsluta med semikolon för att köra en fråga
        \q för att avsluta

test=# CREATE TEXT SEARCH DICTIONARY swedish_ispell (
TEMPLATE = ispell,
DictFile = swedish,
AffFile = swedish,
StopWords = swedish);
FEL:  syntax error at line 219 of affix file
"/usr/local/share/postgresql/tsearch_data/swedish.affix"


I also tried to convert the file again, this time from a terminal that
likes UTF8 thinking that might have an effect, but the affix file looks
the same.

I found a post in the archives regarding a similar problem:
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00825.php

It seems editing the affix file and manually removing some lines at
least partially solved the problem in that case.

// Penty

--

Penty Wenngren
DGC Solutions AB

Re: BUG #3730: Creating a swedish dictionary fails

От
Penty Wenngren
Дата:
On Thu, Nov 08, 2007 at 08:45:32PM -0500, Tom Lane wrote:
> Penty Wenngren <penty.wenngren@dgc.se> writes:
> > On Thu, Nov 08, 2007 at 05:21:17PM -0500, Tom Lane wrote:
> >> I suspect you are working in a locale that doesn't think Ö is a
> >> letter --- check lc_ctype.
>
> > It doesn't seem to make any difference. The first try was done from a
> > terminal that didn't care much for UTF-8, but that is fixed now and I
> > still get the same result.
>
> It sorta looks to me like you only changed the locale of your terminal
> session.  Changing the database's locale requires re-initdb.  What
> does "show lc_ctype" show within Postgres?
>

test=# show lc_ctype;
  lc_ctype
-------------
 sv_SE.UTF-8
(1 rad)

The database was initialized with encoding set to UTF8 and locale set to
sv_SE.UTF8.

If you are confident that this is not a PostgreSQL bug, I'll accept that
happily and move on to do some more testing on my end.

// Penty

--

Penty Wenngren
DGC Solutions AB

Re: BUG #3730: Creating a swedish dictionary fails

От
Alvaro Herrera
Дата:
Penty Wenngren wrote:

> If you are confident that this is not a PostgreSQL bug, I'll accept that
> happily and move on to do some more testing on my end.

I don't think it has been shown that this is not our bug.  I reproduced
your problem here on an UTF8 environment and it does fail for me.
On the other hand, I am unsure how to test whether Ö is letter here, but
at least lower() works on it:

alvherre=# select lower('Ö');
 lower
-------
 ö
(1 fila)

alvherre=# show lc_ctype ;
  lc_ctype
------------
 sv_SE.utf8
(1 fila)

--
Alvaro Herrera       Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
Management by consensus: I have decided; you concede.
(Leonard Liu)

Re: BUG #3730: Creating a swedish dictionary fails

От
Magnus Hagander
Дата:
On Fri, 2007-11-09 at 10:27 -0300, Alvaro Herrera wrote:
> Penty Wenngren wrote:
>=20
> > If you are confident that this is not a PostgreSQL bug, I'll accept that
> > happily and move on to do some more testing on my end.
>=20
> I don't think it has been shown that this is not our bug.  I reproduced
> your problem here on an UTF8 environment and it does fail for me.
> On the other hand, I am unsure how to test whether =C3=96 is letter here,=
 but
> at least lower() works on it:

Not sure exactly what you mean, but =C3=96 certainly is a letter, and =C3=
=B6 is
the lowercase of it, so that part looks correct.

//Magnus

Re: BUG #3730: Creating a swedish dictionary fails

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Penty Wenngren <penty.wenngren@dgc.se> writes:
> > I used iconv to convert svenska.aff and svenska.datalist (from
> > iswedish-1.2.1) to UTF-8. The converted files can be found at:
> > http://www.lederhosen.org/swedish.affix
> > http://www.lederhosen.org/swedish.dict
>
> I think the reason it's failing right there is that that line is the
> first affix rule containing a non-ASCII letter, and the rules are
> supposed to only contain letters and certain specific punctuation.
> I suspect you are working in a locale that doesn't think Ö is a
> letter --- check lc_ctype.

I patched parse_affentry to report the current token and I see this:

alvherre=# CREATE TEXT SEARCH DICTIONARY swedish_ispell (
TEMPLATE = ispell,
DictFile = swedish,
AffFile = swedish,
StopWords = swedish);
ERROR:  syntax error at line 149 (str: "örs
") of affix file "/home/alvherre/Code/CVS/pgsql/install/00orig/share/tsearch_data/swedish.affix"

I am wondering if the newline being included in the token could be
causing a problem.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: BUG #3730: Creating a swedish dictionary fails

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I am wondering if the newline being included in the token could be
> causing a problem.

Nope.  I traced through it and the problem is that char2wchar() is
completely brain-dead: at some places it thinks that "len" is the
length of the output wchar array, and at others it thinks that "len"
is the number of bytes in the input.  In particular, _t_isalpha()
fails completely for any multibyte character, because the pnstrdup
call truncates the character to 1 byte.

After looking at the callers I'm inclined to think that the only
safe way to implement this routine is to change its API to provide
both counts.  Comments?

            regards, tom lane

Re: BUG #3730: Creating a swedish dictionary fails

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > I am wondering if the newline being included in the token could be
> > causing a problem.
>
> Nope.  I traced through it and the problem is that char2wchar() is
> completely brain-dead: at some places it thinks that "len" is the
> length of the output wchar array, and at others it thinks that "len"
> is the number of bytes in the input.  In particular, _t_isalpha()
> fails completely for any multibyte character, because the pnstrdup
> call truncates the character to 1 byte.

Ah, that explains it.  I was reading that code too and did not
understand what was going on.

> After looking at the callers I'm inclined to think that the only
> safe way to implement this routine is to change its API to provide
> both counts.  Comments?

+1

--
Alvaro Herrera                         http://www.flickr.com/photos/alvherre/
Licensee shall have no right to use the Licensed Software
for productive or commercial use. (Licencia de StarOffice 6.0 beta)

Re: BUG #3730: Creating a swedish dictionary fails

От
Alvaro Herrera
Дата:
Magnus Hagander wrote:
>
> On Fri, 2007-11-09 at 10:27 -0300, Alvaro Herrera wrote:
> > Penty Wenngren wrote:
> >
> > > If you are confident that this is not a PostgreSQL bug, I'll accept that
> > > happily and move on to do some more testing on my end.
> >
> > I don't think it has been shown that this is not our bug.  I reproduced
> > your problem here on an UTF8 environment and it does fail for me.
> > On the other hand, I am unsure how to test whether Ö is letter here, but
> > at least lower() works on it:
>
> Not sure exactly what you mean, but Ö certainly is a letter, and ö is
> the lowercase of it, so that part looks correct.

Of course, the point is knowing whether the server believes that to be
the case :-)

--
Alvaro Herrera                          Developer, http://www.PostgreSQL.org/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

Re: BUG #3730: Creating a swedish dictionary fails

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> After looking at the callers I'm inclined to think that the only
>> safe way to implement this routine is to change its API to provide
>> both counts.  Comments?

> +1

I've cleaned this up along with a fair amount of other infelicity in
ts_locale.h/.c.  However, I'm not in a position to test the Windows-
specific bits in wchar2char() and char2wchar() --- could someone
eyeball and/or test what I did?

            regards, tom lane

Re: BUG #3730: Creating a swedish dictionary fails

От
Penty Wenngren
Дата:
On Fri, Nov 09, 2007 at 05:39:58PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> After looking at the callers I'm inclined to think that the only
> >> safe way to implement this routine is to change its API to provide
> >> both counts.  Comments?
>
> > +1
>
> I've cleaned this up along with a fair amount of other infelicity in
> ts_locale.h/.c.  However, I'm not in a position to test the Windows-
> specific bits in wchar2char() and char2wchar() --- could someone
> eyeball and/or test what I did?
>

I just tried to create the dictionary again with the snapshot from
November 12:th, and it works. Nicely done, thanks :)

// Penty

--

Penty Wenngren
DGC Solutions AB