Re: TSearch2 / German compound words / UTF-8
От | Alexander Presber |
---|---|
Тема | Re: TSearch2 / German compound words / UTF-8 |
Дата | |
Msg-id | 7C945F17-1564-4232-BADE-F61D9D7395F2@weisshuhn.de обсуждение исходный текст |
Ответ на | Re: TSearch2 / German compound words / UTF-8 (Teodor Sigaev <teodor@sigaev.ru>) |
Ответы |
Re: TSearch2 / German compound words / UTF-8
(Teodor Sigaev <teodor@sigaev.ru>)
Re: TSearch2 / German compound words / UTF-8 (Teodor Sigaev <teodor@sigaev.ru>) Re: TSearch2 / German compound words / UTF-8 (Teodor Sigaev <teodor@sigaev.ru>) |
Список | pgsql-general |
Hello, Thanks for your efforts, I still don't get it to work. I now tried the norwegian example. My encoding is ISO-8859 (I never used UTF-8, because I thought it would be slower, the thread name is a bit misleading). So I am using an ISO-8859-9 database: ~/cvs/ssd% psql -l Name | Eigentümer | Kodierung -----------+------------+----------- postgres | postgres | LATIN9 tstest | aljoscha | LATIN9 and a norwegian, ISO-8859 encoded dictionary and aff-file: ~% file tsearch/dict/ispell_no/norwegian.dict tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text ~% file tsearch/dict/ispell_no/norwegian.aff tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text the aff-file contains the lines: compoundwords controlled z ... # to compounds only: flag ~\\: [^S] > S and the dictionary containins: overtrekk/BCW\z (meaning: word can be compound part, intermediary "s" is allowed) My configuration is: tstest=# SELECT * FROM tsearch2.pg_ts_cfg; ts_name | prs_name | locale -----------+----------+------------ simple | default | de_DE@euro german | default | de_DE@euro norwegian | default | de_DE@euro Now the test: tstest=# SELECT tsearch2.lexize('ispell_no','overtrekksgrill'); lexize -------- (1 Zeile) BUT: tstest=# SELECT tsearch2.lexize('ispell_no','overtrekkgrill'); lexize ------------------------------------ {over,trekk,grill,overtrekk,grill} (1 Zeile) It simply doesn't work. No UTF-8 is involved. Sincerely yours, Alexander Presber P.S.: Henning: Sorry for bothering you with the CC, just ignore it, if you like. Am 27.01.2006 um 18:17 schrieb Teodor Sigaev: > contrib_regression=# insert into pg_ts_dict values ( > 'norwegian_ispell', > (select dict_init from pg_ts_dict where > dict_name='ispell_template'), > 'DictFile="/usr/local/share/ispell/norsk.dict" ,' > 'AffFile ="/usr/local/share/ispell/norsk.aff"', > (select dict_lexize from pg_ts_dict where > dict_name='ispell_template'), > 'Norwegian ISpell dictionary' > ); > INSERT 16681 1 > contrib_regression=# select lexize('norwegian_ispell','politimester'); > lexize > ------------------------------------------ > {politimester,politi,mester,politi,mest} > (1 row) > > contrib_regression=# select lexize > ('norwegian_ispell','sjokoladefabrikk'); > lexize > -------------------------------------- > {sjokoladefabrikk,sjokolade,fabrikk} > (1 row) > > contrib_regression=# select lexize > ('norwegian_ispell','overtrekksgrilldresser'); > lexize > ------------------------- > {overtrekk,grill,dress} > (1 row) > % psql -l > List of databases > Name | Owner | Encoding > --------------------+--------+---------- > contrib_regression | teodor | KOI8 > postgres | pgsql | KOI8 > template0 | pgsql | KOI8 > template1 | pgsql | KOI8 > (4 rows) > > > I'm afraid that UTF-8 problem. We just committed in CVS HEAD > multibyte support for tsearch2, so you can try it. > > Pls, notice, the dict, aff stopword files should be in server > encoding. Snowball sources for german (and other) in UTF8 can be > founded in http://snowball.tartarus.org/dist/libstemmer_c.tgz > > To all: May be, we should put all snowball's stemmers (for all > available languages and encodings) to tsearch2 directory? > > -- > Teodor Sigaev E-mail: > teodor@sigaev.ru > WWW: http:// > www.sigaev.ru/
В списке pgsql-general по дате отправления: