Re: Tsearch2 and Unicode?

Поиск
Список
Период
Сортировка
От Markus Wollny
Тема Re: Tsearch2 and Unicode?
Дата
Msg-id 2266D0630E43BB4290742247C8910575068B75A3@dozer.computec.de
обсуждение исходный текст
Ответ на Tsearch2 and Unicode?  (Dawid Kuroczko <qnex42@gmail.com>)
Ответы Re: Tsearch2 and Unicode?  (Oleg Bartunov <oleg@sai.msu.su>)
Список pgsql-general
Hi!

I dug through my list-archives - I actually used to have the very same problem that you described: special chars being
swallowedby tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE@euro as locale,
whereasmy databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct
UTF-8-locale(de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion
here:http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php 
If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program
that'ssupplied with PostgreSQL. 

Kind regards

   Markus



> -----Ursprüngliche Nachricht-----
> Von: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von
> Dawid Kuroczko
> Gesendet: Mittwoch, 17. November 2004 17:17
> An: Pgsql General
> Betreff: [GENERAL] Tsearch2 and Unicode?
>
> I'm trying to use tsearch2 with database which is in
> 'UNICODE' encoding.
> It works fine for English text, but as I intend to search
> Polish texts I did:
>
> insert into pg_ts_cfg('default_polish', 'default',
> 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as
> written in manual).
>
> However, Polish-specific chars are being eaten alive, it seems.
> I.e. doing select to_tsvector('default_polish', body) from
> messages; results in list of words but with national chars stripped...
>
> I wonder, am I doing something wrong, or just tsearch2
> doesn't grok Unicode, despite the locales setting?  This also
> is a good question regarding ispell_dict and its feelings
> regarding Unicode, but that's another story.
>
> Assuming Unicode unsupported means I should perhaps... oh,
> convert the data to iso8859 prior feeding it to_tsvector()...
>  interesting idea, but so far I have failed to actually do
> it.  Maybe store the data as 'bytea' and add a column with
> encoding information (assuming I don't want to recreate whole
> database with new encoding, and that I want to use unicode
> for some columns (so I don't have to keep encoding with every
> text everywhere...).
>
> And while we are at it, how do you feel -- an extra column
> with tsvector and its index -- would it be OK to keep it away
> from my data (so I can safely get rid of them if need be)?
> [ I intend to keep index of around 2 000 000 records, few KBs
> of text each ]...
>
>   Regards,
>       Dawid Kuroczko
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>

В списке pgsql-general по дате отправления:

Предыдущее
От: Ian Barwick
Дата:
Сообщение: Re: Oid to text...
Следующее
От: Matt
Дата:
Сообщение: Re: How to handle larger databases?