Re: Tsearch2 and Unicode?

Поиск
Список
Период
Сортировка
От Oleg Bartunov
Тема Re: Tsearch2 and Unicode?
Дата
Msg-id Pine.GSO.4.61.0411221645540.24069@ra.sai.msu.su
обсуждение исходный текст
Ответ на Re: Tsearch2 and Unicode?  ("Markus Wollny" <Markus.Wollny@computec.de>)
Список pgsql-general
This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

---559023410-491009931-1101131295=:24069
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: 8BIT

Markus,

it'd be nice if you (or somebody) wrtite a note about unicode, so it
could be added to tsearch2 documentation. It will help people and save
time and hair :)


     Oleg
On Mon, 22 Nov 2004, Markus Wollny wrote:

> Hi!
>
> I dug through my list-archives - I actually used to have the very same problem that you described: special chars
beingswallowed by tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE@euro as
locale,whereas my databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct
UTF-8-locale(de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion
here:http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php 
> If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program
that'ssupplied with PostgreSQL. 
>
> Kind regards
>
>   Markus
>
>
>
>> -----Urspr�ngliche Nachricht-----
>> Von: pgsql-general-owner@postgresql.org
>> [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von
>> Dawid Kuroczko
>> Gesendet: Mittwoch, 17. November 2004 17:17
>> An: Pgsql General
>> Betreff: [GENERAL] Tsearch2 and Unicode?
>>
>> I'm trying to use tsearch2 with database which is in
>> 'UNICODE' encoding.
>> It works fine for English text, but as I intend to search
>> Polish texts I did:
>>
>> insert into pg_ts_cfg('default_polish', 'default',
>> 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as
>> written in manual).
>>
>> However, Polish-specific chars are being eaten alive, it seems.
>> I.e. doing select to_tsvector('default_polish', body) from
>> messages; results in list of words but with national chars stripped...
>>
>> I wonder, am I doing something wrong, or just tsearch2
>> doesn't grok Unicode, despite the locales setting?  This also
>> is a good question regarding ispell_dict and its feelings
>> regarding Unicode, but that's another story.
>>
>> Assuming Unicode unsupported means I should perhaps... oh,
>> convert the data to iso8859 prior feeding it to_tsvector()...
>>  interesting idea, but so far I have failed to actually do
>> it.  Maybe store the data as 'bytea' and add a column with
>> encoding information (assuming I don't want to recreate whole
>> database with new encoding, and that I want to use unicode
>> for some columns (so I don't have to keep encoding with every
>> text everywhere...).
>>
>> And while we are at it, how do you feel -- an extra column
>> with tsvector and its index -- would it be OK to keep it away
>> from my data (so I can safely get rid of them if need be)?
>> [ I intend to keep index of around 2 000 000 records, few KBs
>> of text each ]...
>>
>>   Regards,
>>       Dawid Kuroczko
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 5: Have you checked our extensive FAQ?
>>
>>                http://www.postgresql.org/docs/faqs/FAQ.html
>>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---559023410-491009931-1101131295=:24069--

В списке pgsql-general по дате отправления:

Предыдущее
От: Matt
Дата:
Сообщение: Re: How to handle larger databases?
Следующее
От: Scott Nixon
Дата:
Сообщение: Help with syntax for timestamp addition