Обсуждение: TSearch2 / German compound words / UTF-8

Поиск

Список

Период

Сортировка

TSearch2 / German compound words / UTF-8

От

Hannes Dorbath

Дата:

23 ноября 2005 г., 05:56:38

Hi,

I'm on PG 8.0.4, initDB and locale set to de_DE.UTF-8, FreeBSD.

My TSearch config is based on "Tsearch2 and Unicode/UTF-8" by Markus
Wollny (http://tinyurl.com/a6po4).

The following files are used:

http://hannes.imos.net/german.med          [UTF-8]
http://hannes.imos.net/german.aff          [ANSI]
http://hannes.imos.net/german.stop         [UTF-8]
http://hannes.imos.net/german.stop.ispell  [UTF-8]

german.med is from "ispell-german-compound.tar.gz", available on the
TSearch2 site, recoded to UTF-8.

The first problem is with german compound words and does not have to do
anything with UTF-8:

In german often an "s" is used to "link" two words into an compound
word. This is true for many german compound words. TSearch/ispell is not
able to break those words up, only exact matches work.

An example with "Produktionsintervall" (production interval):

fts=# SELECT ts_debug('Produktionsintervall');
                                              ts_debug
--------------------------------------------------------------------------------------------------
  (default_german,lword,"Latin
word",Produktionsintervall,"{de_ispell,de}",'produktionsintervall')

Tsearch/isepll is not able to break this word into parts, because of the
"s" in "Produktion/s/intervall". Misspelling the word as
"Produktionintervall" fixes it:

fts=# SELECT ts_debug('Produktionintervall');
                                                       ts_debug
---------------------------------------------------------------------------------------------------------------------
  (default_german,lword,"Latin
word",Produktionintervall,"{de_ispell,de}","'ion' 'produkt' 'intervall'
'produktion'")

How can I fix this / get TSearch to remove/stem the last "s" on a word
before (re-)searching the dict? Can I modify my dict or hack something
else? This is a bit of a show stopper :/


The second thing is with UTF-8:

I know there is no, or no full support yet, but I need to get it as good
as it's possible /now/. Is there anything in CVS that I might be able to
backport to my version or other tips? My setup works, as for the dict
and the stop word files, but I fear the stemming and mapping of umlauts
and other special chars does not as it should. I tried recoding the
german.aff to UTF-8 as well, but that breaks it with an regex error
sometimes:

fts=# SELECT ts_debug('dass');
ERROR:  Regex error in '[^sãŸ]$': brackets [] not balanced
CONTEXT:  SQL function "ts_debug" statement 1

This seems while it tries to map ss to ß, but anyway, I fear, I didn't
anything good with that.

As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second
snowball dict. The first lines of the stem.h I used start with:

> extern struct SN_env * german_ISO_8859_1_create_env(void);

So I guess this will not work exactly well with UTF-8 ;p Is there any
other stem.h I could use? Google hasn't returned much for me :/


Thanks for reading and all our time. I'll consider the donate button
after I get this working ;/

--
Regards,
Hannes Dorbath

TSearch2 / UTF-8 and stat() function

От

Hannes Dorbath

Дата:

23 ноября 2005 г., 06:20:25

Another UTF-8 thing I forgot:

fts=# SELECT * FROM stat('SELECT to_tsvector(''simple'', line) FROM fts;');
ERROR:  invalid byte sequence for encoding "UNICODE": 0xe2a7

The query inside the stat() function alone works fine. I have not set
any client encoding. What breaks it? It works as long the inside query
does not return UTF-8 in vectors.

Thanks.

--
Regards,
Hannes Dorbath

Re: TSearch2 / German compound words / UTF-8

От

Teodor Sigaev

Дата:

23 ноября 2005 г., 06:37:44

> Tsearch/isepll is not able to break this word into parts, because of the
> "s" in "Produktion/s/intervall". Misspelling the word as
> "Produktionintervall" fixes it:

It should be affixes marked as 'affix in middle of compound word',
Flag is '~', example look in norsk dictionary:

flag ~\\:
     [^S]           >        S              #~ advarsel > advarsels-

BTW, we develop and debug compound word support on norsk (norwegian) dictionary,
so look for example there. But we don't know Norwegian, norwegians helped us  :)



>
>
> The second thing is with UTF-8:
>
> I know there is no, or no full support yet, but I need to get it as good
> as it's possible /now/. Is there anything in CVS that I might be able to
> backport to my version or other tips? My setup works, as for the dict
> and the stop word files, but I fear the stemming and mapping of umlauts
> and other special chars does not as it should. I tried recoding the
> german.aff to UTF-8 as well, but that breaks it with an regex error
> sometimes:

Now in CVS it is deep alpha version and now only text parser is UTF-compliant,
we continue development...


>
> fts=# SELECT ts_debug('dass');
> ERROR:  Regex error in '[^sãŸ]$': brackets [] not balanced
> CONTEXT:  SQL function "ts_debug" statement 1
>
> This seems while it tries to map ss to ß, but anyway, I fear, I didn't
> anything good with that.
>
> As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second
> snowball dict. The first lines of the stem.h I used start with:
>
>> extern struct SN_env * german_ISO_8859_1_create_env(void);
Can you use ISO-8859-1?

> So I guess this will not work exactly well with UTF-8 ;p Is there any
> other stem.h I could use? Google hasn't returned much for me :/

http://snowball.tartarus.org/

Snowball can generate UTF parser:
http://snowball.tartarus.org/runtime/use.html:
     F1 [-o[utput] F2]
        [-s[yntax]]
        [-w[idechars]]  [-u[tf8]] <-------- that's it!
        [-j[ava]]  [-n[ame] C]
        [-ep[refix] S1]  [-vp[refix] S2]
        [-i[nclude] D]
        [-r[untime] P]
At least for Russian there is 2 parsers, for KOI8 and UTF, (
http://snowball.tartarus.org/algorithms/russian/stem.sbl
http://snowball.tartarus.org/algorithms/russian/stem-Unicode.sbl
), diff shows that they different only in stringdef section. So you can make UTF
parser for german.
BUT, I'm afraid that Snowball uses widechar, and postgres use multibyte for UTF
internally.



--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: TSearch2 / German compound words / UTF-8

От

Oleg Bartunov

Дата:

23 ноября 2005 г., 07:07:33

On Wed, 23 Nov 2005, Hannes Dorbath wrote:

> Hi,
>
> I'm on PG 8.0.4, initDB and locale set to de_DE.UTF-8, FreeBSD.
>
> My TSearch config is based on "Tsearch2 and Unicode/UTF-8" by Markus Wollny
> (http://tinyurl.com/a6po4).
>
> The following files are used:
>
> http://hannes.imos.net/german.med          [UTF-8]
> http://hannes.imos.net/german.aff          [ANSI]
> http://hannes.imos.net/german.stop         [UTF-8]
> http://hannes.imos.net/german.stop.ispell  [UTF-8]
>
> german.med is from "ispell-german-compound.tar.gz", available on the TSearch2
> site, recoded to UTF-8.
>
> The first problem is with german compound words and does not have to do
> anything with UTF-8:
>
> In german often an "s" is used to "link" two words into an compound word.
> This is true for many german compound words. TSearch/ispell is not able to
> break those words up, only exact matches work.
>
> An example with "Produktionsintervall" (production interval):
>
> fts=# SELECT ts_debug('Produktionsintervall');
>                                             ts_debug
> --------------------------------------------------------------------------------------------------
> (default_german,lword,"Latin
> word",Produktionsintervall,"{de_ispell,de}",'produktionsintervall')
>
> Tsearch/isepll is not able to break this word into parts, because of the "s"
> in "Produktion/s/intervall". Misspelling the word as "Produktionintervall"
> fixes it:
>
> fts=# SELECT ts_debug('Produktionintervall');
>                                                      ts_debug
> ---------------------------------------------------------------------------------------------------------------------
> (default_german,lword,"Latin
> word",Produktionintervall,"{de_ispell,de}","'ion' 'produkt' 'intervall'
> 'produktion'")
>
> How can I fix this / get TSearch to remove/stem the last "s" on a word before
> (re-)searching the dict? Can I modify my dict or hack something else? This is
> a bit of a show stopper :/


I think the right way is to fix affix file, i.e. add appropriate rule,
but this is out of our skill :) You, probable, should send your
complains/suggestions to erstellt von transam email: transam45@gmx.net
(see german.aff)

>
>
> The second thing is with UTF-8:
>
> I know there is no, or no full support yet, but I need to get it as good as
> it's possible /now/. Is there anything in CVS that I might be able to
> backport to my version or other tips? My setup works, as for the dict and the
> stop word files, but I fear the stemming and mapping of umlauts and other
> special chars does not as it should. I tried recoding the german.aff to UTF-8
> as well, but that breaks it with an regex error sometimes:
>
> fts=# SELECT ts_debug('dass');
> ERROR:  Regex error in '[^s??]$': brackets [] not balanced
> CONTEXT:  SQL function "ts_debug" statement 1
>
> This seems while it tries to map ss to ?, but anyway, I fear, I didn't
> anything good with that.

Similar problem was discussed
http://sourceforge.net/mailarchive/forum.php?thread_id=6271285&forum_id=7671


>
> As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second
> snowball dict. The first lines of the stem.h I used start with:
>
>> extern struct SN_env * german_ISO_8859_1_create_env(void);
>
> So I guess this will not work exactly well with UTF-8 ;p Is there any other
> stem.h I could use? Google hasn't returned much for me :/
>

As we mentioned several times, tsearch2 doesn't supports UTF-8 and
is working only by accident :) We've got working parser with full UTF-8
support, but we need to rewrite interfaces to dictionaries, so there is nothing
useful to the moment. All changes are available in CVS HEAD (8.2dev).

Backpatch for 8.1 will be available from our site as soon as we complete
UTF-8 support for CVS HEAD. We have no deadlines yet, but we have discussed
support of this project with OpenACS community (grant from University of
Mannheim), so it's possible that we could complete it really soon
(we have no answer yet).


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: TSearch2 / German compound words / UTF-8

От

Alexander Presber

Дата:

27 января 2006 г., 10:11:24

>> Tsearch/isepll is not able to break this word into parts, because
>> of the "s" in "Produktion/s/intervall". Misspelling the word as
>> "Produktionintervall" fixes it:
> It should be affixes marked as 'affix in middle of compound word',
> Flag is '~', example look in norsk dictionary:
>
> flag ~\\:
>     [^S]           >        S              #~ advarsel > advarsels-
>
> BTW, we develop and debug compound word support on norsk
> (norwegian) dictionary, so look for example there. But we don't
> know Norwegian, norwegians helped us :)

Hello everyone!

I cannot get this to work. Neither in a german version, nor with the
norwegian example supplied on the tsearch website.
That means, just like Hannes I can get compound word support without
inserted 's' in german and norwegian:
"Vertragstrafe" works, but not "Vertragsstrafe", which is the right
Form.

So I tried it the other way around: My dictionary consists of two words:

---
vertrag/zs
strafe/z
  ---

My affixes file just switches on compounds and allows for s-insertion
as described in the norwegian tutorial:

---
compoundwords controlled z
suffixes
flag s:
   [^S] > S              # endet nicht auf "s": "s" anfuegen und in
compound-check ("Recht" > "Rechts-")
---

ts_debug yields:

tstest=# SELECT tsearch2.ts_debug('vertragstrafe strafevertrag
vertragsstrafe');
                                       ts_debug
------------------------------------------------------------------------
-------------
(german,lword,"Latin
word",vertragstrafe,"{ispell_de,simple}","'strafe' 'vertrag'")
(german,lword,"Latin
word",strafevertrag,"{ispell_de,simple}","'strafe' 'vertrag'")
(german,lword,"Latin
word",vertragsstrafe,"{ispell_de,simple}",'vertragsstrafe')
(3 Zeilen)

I would say, the ispell compound support does not honor the s-Flag in
compounds.
Could it be, that this feature got lost in a regression? It must have
worked for norwegian once. (Take the "overtrekksgrilldresser" example
from the tsearch2:compounds tutorial, that I cannot reproduce).

Any hints?

Alexander

Re: TSearch2 / German compound words / UTF-8

От

Alexander Presber

Дата:

27 января 2006 г., 10:41:03

I should add that, with the minimal dictionary and .aff file,
"vertrags" gets reduced alright, dropping the trailing 's':

tstest=# SELECT tsearch2.ts_debug('vertrags');
                               ts_debug
---------------------------------------------------------------------
(german,lword,"Latin word",vertrags,"{ispell_de,simple}",'vertrag')
(1 Zeile)

The affix is just not applied while looking for compound words.

Sincerely yours
Alexander Presber

Re: TSearch2 / German compound words / UTF-8

От

Oleg Bartunov

Дата:

27 января 2006 г., 13:00:59

Alexander,

could you try tsearch2 from CVS HEAD  ?
tsearch2 in 8.1.X doesn't supports UTF-8 and works for someone
only by accident :)

     Oleg
On Fri, 27 Jan 2006, Alexander Presber wrote:

>>> Tsearch/isepll is not able to break this word into parts, because of the
>>> "s" in "Produktion/s/intervall". Misspelling the word as
>>> "Produktionintervall" fixes it:
>> It should be affixes marked as 'affix in middle of compound word',
>> Flag is '~', example look in norsk dictionary:
>>
>> flag ~\\:
>>    [^S]           >        S              #~ advarsel > advarsels-
>>
>> BTW, we develop and debug compound word support on norsk (norwegian)
>> dictionary, so look for example there. But we don't know Norwegian,
>> norwegians helped us :)
>
> Hello everyone!
>
> I cannot get this to work. Neither in a german version, nor with the
> norwegian example supplied on the tsearch website.
> That means, just like Hannes I can get compound word support without inserted
> 's' in german and norwegian:
> "Vertragstrafe" works, but not "Vertragsstrafe", which is the right Form.
>
> So I tried it the other way around: My dictionary consists of two words:
>
> ---
> vertrag/zs
> strafe/z
> ---
>
> My affixes file just switches on compounds and allows for s-insertion as
> described in the norwegian tutorial:
>
> ---
> compoundwords controlled z
> suffixes
> flag s:
> [^S] > S              # endet nicht auf "s": "s" anfuegen und in
> compound-check ("Recht" > "Rechts-")
> ---
>
> ts_debug yields:
>
> tstest=# SELECT tsearch2.ts_debug('vertragstrafe strafevertrag
> vertragsstrafe');
>                                     ts_debug
> -------------------------------------------------------------------------------------
> (german,lword,"Latin word",vertragstrafe,"{ispell_de,simple}","'strafe'
> 'vertrag'")
> (german,lword,"Latin word",strafevertrag,"{ispell_de,simple}","'strafe'
> 'vertrag'")
> (german,lword,"Latin
> word",vertragsstrafe,"{ispell_de,simple}",'vertragsstrafe')
> (3 Zeilen)
>
> I would say, the ispell compound support does not honor the s-Flag in
> compounds.
> Could it be, that this feature got lost in a regression? It must have worked
> for norwegian once. (Take the "overtrekksgrilldresser" example from the
> tsearch2:compounds tutorial, that I cannot reproduce).
>
> Any hints?
>
> Alexander
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>     choose an index scan if your joining column's datatypes do not
>     match

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: TSearch2 / German compound words / UTF-8

От

Teodor Sigaev

Дата:

27 января 2006 г., 13:17:36

contrib_regression=# insert into pg_ts_dict values (
          'norwegian_ispell',
           (select dict_init from pg_ts_dict where dict_name='ispell_template'),
           'DictFile="/usr/local/share/ispell/norsk.dict" ,'
           'AffFile ="/usr/local/share/ispell/norsk.aff"',
          (select dict_lexize from pg_ts_dict where dict_name='ispell_template'),
          'Norwegian ISpell dictionary'
    );
INSERT 16681 1
contrib_regression=# select lexize('norwegian_ispell','politimester');
                   lexize
------------------------------------------
  {politimester,politi,mester,politi,mest}
(1 row)

contrib_regression=# select lexize('norwegian_ispell','sjokoladefabrikk');
                 lexize
--------------------------------------
  {sjokoladefabrikk,sjokolade,fabrikk}
(1 row)

contrib_regression=# select lexize('norwegian_ispell','overtrekksgrilldresser');
          lexize
-------------------------
  {overtrekk,grill,dress}
(1 row)
% psql -l
            List of databases
         Name        | Owner  | Encoding
--------------------+--------+----------
  contrib_regression | teodor | KOI8
  postgres           | pgsql  | KOI8
  template0          | pgsql  | KOI8
  template1          | pgsql  | KOI8
(4 rows)


I'm afraid that UTF-8 problem. We just committed in CVS HEAD multibyte support
for tsearch2, so you can try it.

Pls, notice, the dict, aff stopword files should be in server encoding. Snowball
sources for german (and other) in UTF8 can be founded in
http://snowball.tartarus.org/dist/libstemmer_c.tgz

To all: May be, we should put all snowball's stemmers (for all available
languages and encodings) to tsearch2 directory?

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: TSearch2 / German compound words / UTF-8

От

Harald Armin Massa

Дата:

27 января 2006 г., 13:33:21

Teodor,

To all: May be, we should put all snowball's stemmers (for all available
languages and encodings) to tsearch2 directory?

Yes, that would be VERY helpfull. Up to now I do not dare to use tsearch2 because "get stemmer here, get dictionary there..."

Harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607

Re: TSearch2 / German compound words / UTF-8

От

Oleg Bartunov

Дата:

30 января 2006 г., 08:11:37

On Fri, 27 Jan 2006, Harald Armin Massa wrote:

> Teodor,
>
>>
>> To all: May be, we should put all snowball's stemmers (for all available
>> languages and encodings) to tsearch2 directory?
>
>
> Yes, that would be VERY helpfull. Up to now I do not dare to use tsearch2
> because "get stemmer here, get dictionary there..."

Hmm, we could provide snowball stemmers tsearch2-ready (about 700kb),
but ispell dictionaries could be very large.

>
> Harald
> --
> GHUM Harald Massa
> persuadere et programmare
> Harald Armin Massa
> Reinsburgstra?e 202b
> 70197 Stuttgart
> 0173/9409607
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: TSearch2 / German compound words / UTF-8

От

Mike Rylander

Дата:

30 января 2006 г., 22:55:53

On 1/30/06, Oleg Bartunov <oleg@sai.msu.su> wrote:
> On Fri, 27 Jan 2006, Harald Armin Massa wrote:
>
> > Teodor,
> >
> >>
> >> To all: May be, we should put all snowball's stemmers (for all available
> >> languages and encodings) to tsearch2 directory?
> >
> >
> > Yes, that would be VERY helpfull. Up to now I do not dare to use tsearch2
> > because "get stemmer here, get dictionary there..."
>
> Hmm, we could provide snowball stemmers tsearch2-ready (about 700kb),
> but ispell dictionaries could be very large.

I would be willing to host them.  I have plenty of space, and
bandwidth (within reason).

>
> >
> > Harald
> > --
> > GHUM Harald Massa
> > persuadere et programmare
> > Harald Armin Massa
> > Reinsburgstra?e 202b
> > 70197 Stuttgart
> > 0173/9409607
> >
>
>         Regards,
>                 Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: TSearch2 / German compound words / UTF-8

От

Alexander Presber

Дата:

17 февраля 2006 г., 10:36:56

Hello,

Thanks for your efforts, I still don't get it to work.
I now tried the norwegian example. My encoding is ISO-8859 (I never
used UTF-8, because I thought it would be slower, the thread name is
a bit misleading).

So I am using an ISO-8859-9 database:

   ~/cvs/ssd% psql -l

      Name    | Eigentümer | Kodierung
   -----------+------------+-----------
    postgres  | postgres   | LATIN9
    tstest    | aljoscha   | LATIN9

and a norwegian, ISO-8859 encoded dictionary and aff-file:

   ~% file tsearch/dict/ispell_no/norwegian.dict
   tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
   ~% file tsearch/dict/ispell_no/norwegian.aff
   tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text

the aff-file contains the lines:

   compoundwords controlled z
   ...
   #            to compounds only:
   flag ~\\:
      [^S]    > S

and the dictionary containins:

   overtrekk/BCW\z

   (meaning: word can be compound part, intermediary "s" is allowed)

My configuration is:

   tstest=# SELECT * FROM tsearch2.pg_ts_cfg;
     ts_name  | prs_name |   locale
   -----------+----------+------------
    simple    | default  | de_DE@euro
    german    | default  | de_DE@euro
    norwegian | default  | de_DE@euro


Now the test:

   tstest=# SELECT tsearch2.lexize('ispell_no','overtrekksgrill');
    lexize
   --------

   (1 Zeile)

BUT:

   tstest=# SELECT tsearch2.lexize('ispell_no','overtrekkgrill');
                  lexize
   ------------------------------------
    {over,trekk,grill,overtrekk,grill}
   (1 Zeile)


It simply doesn't work. No UTF-8 is involved.

Sincerely yours,

Alexander Presber

P.S.: Henning: Sorry for bothering you with the CC, just ignore it,
if you like.


Am 27.01.2006 um 18:17 schrieb Teodor Sigaev:

> contrib_regression=# insert into pg_ts_dict values (
>          'norwegian_ispell',
>           (select dict_init from pg_ts_dict where
> dict_name='ispell_template'),
>           'DictFile="/usr/local/share/ispell/norsk.dict" ,'
>           'AffFile ="/usr/local/share/ispell/norsk.aff"',
>          (select dict_lexize from pg_ts_dict where
> dict_name='ispell_template'),
>          'Norwegian ISpell dictionary'
>    );
> INSERT 16681 1
> contrib_regression=# select lexize('norwegian_ispell','politimester');
>                   lexize
> ------------------------------------------
>  {politimester,politi,mester,politi,mest}
> (1 row)
>
> contrib_regression=# select lexize
> ('norwegian_ispell','sjokoladefabrikk');
>                 lexize
> --------------------------------------
>  {sjokoladefabrikk,sjokolade,fabrikk}
> (1 row)
>
> contrib_regression=# select lexize
> ('norwegian_ispell','overtrekksgrilldresser');
>          lexize
> -------------------------
>  {overtrekk,grill,dress}
> (1 row)
> % psql -l
>            List of databases
>         Name        | Owner  | Encoding
> --------------------+--------+----------
>  contrib_regression | teodor | KOI8
>  postgres           | pgsql  | KOI8
>  template0          | pgsql  | KOI8
>  template1          | pgsql  | KOI8
> (4 rows)
>
>
> I'm afraid that UTF-8 problem. We just committed in CVS HEAD
> multibyte support for tsearch2, so you can try it.
>
> Pls, notice, the dict, aff stopword files should be in server
> encoding. Snowball sources for german (and other) in UTF8 can be
> founded in http://snowball.tartarus.org/dist/libstemmer_c.tgz
>
> To all: May be, we should put all snowball's stemmers (for all
> available languages and encodings) to tsearch2 directory?
>
> --
> Teodor Sigaev                                   E-mail:
> teodor@sigaev.ru
>                                                    WWW: http://
> www.sigaev.ru/

Re: TSearch2 / German compound words / UTF-8

От

Teodor Sigaev

Дата:

17 февраля 2006 г., 13:39:58

Very strange...

>   ~% file tsearch/dict/ispell_no/norwegian.dict
>   tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
>   ~% file tsearch/dict/ispell_no/norwegian.aff
>   tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text

Can you place that files anywhere wher I can download it (or mail it directly to
me)?


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: TSearch2 / German compound words / UTF-8

От

Teodor Sigaev

Дата:

17 февраля 2006 г., 13:55:31

BTW, if you take norwegian dictionary from
http://folk.uio.no/runekl/dictionary.html then try to build it from OpenOffice
sources (http://lingucomponent.openoffice.org/spell_dic.html, tsearch2/my2ispell).

I found mails in my archive which says that norwegian people prefer OpenOffice's
one.
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: TSearch2 / German compound words / UTF-8

От

Oleg Bartunov

Дата:

17 февраля 2006 г., 14:54:02

Norwegian (Nynorsk and Bokmaal) ispell dictionaries are available from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

I didn't test them.

     Oleg
On Fri, 17 Feb 2006, Teodor Sigaev wrote:

> Very strange...
>
>>   ~% file tsearch/dict/ispell_no/norwegian.dict
>>   tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
>>   ~% file tsearch/dict/ispell_no/norwegian.aff
>>   tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text
>
> Can you place that files anywhere wher I can download it (or mail it directly
> to me)?
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: TSearch2 / German compound words / UTF-8

От

Teodor Sigaev

Дата:

18 февраля 2006 г., 06:29:03

Hmm, I have found a small bug:
When there is a compound affix with zero length of search pattern (which
should not be!), ispell dictionary ignores all other compound affixes.
Original afix file contains

flag ~\`:
    E              >       -E,NINGS        #~ avskrive > avskrivnings-
    Z Y Z Y Z Y    >       -ZYZYZY,-       #- flerezyzyzy > fler-

ZYZYZY makes down other affixes. Thats why my2ispell removes zyzyzy affix...

I fix it in code of dictionary. Try attached patch, I'll apply it on
monday to CVS.

Thanks a lot for persistence.

Вложения

ispell.patch.gz

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: TSearch2 / German compound words / UTF-8

Вложения