Re: FTS with more than one language in body and with unknown query language?

Поиск
Список
Период
Сортировка
От Artur Zakirov
Тема Re: FTS with more than one language in body and with unknown query language?
Дата
Msg-id 55f5ab5c-0678-922f-3852-48901ba4c557@postgrespro.ru
обсуждение исходный текст
Ответ на Re: FTS with more than one language in body and with unknown query language?  (Stefan Keller <sfkeller@gmail.com>)
Ответы Re: FTS with more than one language in body and with unknown query language?  (Stefan Keller <sfkeller@gmail.com>)
Список pgsql-general
Hello, Stefan!

On 15.07.2016 01:54, Stefan Keller wrote:
> приве́т! Artur
>
> Thanks for your explanations.
>
> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
>> On 14.07.2016 01:16, Stefan Keller wrote:
> ...
>>> * Should I create a synonym dictionary which contains word
>>> translations en-de instead of synonyms en-en?
>>
>> This synonym dictionary will contain a thousands entries. So it will require
>> a great effort to make this dictionary.
>
> It's a domain-specific corpus of max. 1000 records of descriptive text
> (metadata) about geographic data, like topographic map, land use
> planning, etc.

Oh, I understand. If you need word stemming you can use a thesaurus
dictionary. So you should do the following.

1 - Create the file $SHAREDIR/tsearch_data/geo.ths. Here is my example
of entry:

wald : forest

2 - PostgreSQL already has english_stem dictionary. In this solution you
do not need german_stem dictionary. Create the dictionary and the
configuration in PostgreSQL:

=# CREATE TEXT SEARCH DICTIONARY geo_ths (
     Template = thesaurus,
     DictFile = geo,
     Dictionary = pg_catalog.english_stem);
=# CREATE TEXT SEARCH CONFIGURATION geo_ths(COPY='simple');
=# ALTER TEXT SEARCH CONFIGURATION geo_ths
     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
         word, hword, hword_part
     WITH geo_ths, english_stem;

For example, we have the following data:

=# CREATE TABLE geo (id int, body_en text, body_de text);
=# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald');

Then you can execute the following queries:

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
to_tsquery('geo_ths', 'forest');
  id | body_en | body_de
----+---------+---------
   1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
to_tsquery('geo_ths', 'forest');
  id | body_en | body_de
----+---------+---------
   2 | (null)  | wald
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
to_tsquery('geo_ths', 'wald');
  id | body_en | body_de
----+---------+---------
   1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
to_tsquery('geo_ths', 'wald');
  id | body_en | body_de
----+---------+---------
   2 | (null)  | wald
(1 row)

This solution does not take into account the fact that queries in german
can be in various forms (for example, plural forms). It works well for
english. As a solution of this, you can put into geo.ths various forms
of word.

But maybe the solution above is what you need. Otherwise it can serves
as a concept of solution.

>
> ...
>>> * How to setup a text search configuration which e.g. stems en and de
>>> words?
>
> I still would like to give FTS a try with synonym dictionary (en-de).
> Now, I'm wondering how to setup the configuration. I've seen examples
> to process either english, german or russian alone. But I did not find
> yet any documentation on how to setup the text search configuration
> where a corpus contains two (or more) languages at same time in a
> table (body_en and body_de).
>
> :Stefan
>
>
>
>
>
>
>
> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
>> Hi,
>>
>> On 14.07.2016 01:16, Stefan Keller wrote:
>>>
>>> Hi,
>>>
>>> I have a text corpus which contains either German or English docs and
>>> I expect queries where I don't know if it's German or English. So I'd
>>> like e.g. that a query "forest" matches "forest" in body_en but also
>>> "Wald" in body_de.
>>>
>>> I created a table with attributes body_en and body_de (type "text"). I
>>> will use ts_vector/ts_query on the fly (don't need yet an index
>>> (attributes)).
>>>
>>> * Can FTS handle this multilingual situation?
>>
>>
>> In my opinion, PostgreSQL cant handle it. It cant translate words from one
>> language to another, it just stems word from original form to basic form.
>> First you need to translate word from English to German, then search word in
>> the body_de attribute.
>>
>> And the issue is complicated by the fact that one word could have different
>> meaning in the other language.
>>
>>> * How to setup a text search configuration which e.g. stems en and de
>>> words?
>>> * Should I create a synonym dictionary which contains word
>>> translations en-de instead of synonyms en-en?
>>
>>
>> This synonym dictionary will contain a thousands entries. So it will require
>> a great effort to make this dictionary.
>>
>>
>>> * Any hints to related work where FTS has been used in a multilingual
>>> context?
>>>
>>> :Stefan
>>>
>>>
>>
>> --
>> Artur Zakirov
>> Postgres Professional: http://www.postgrespro.com
>> Russian Postgres Company


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


В списке pgsql-general по дате отправления:

Предыдущее
От: Kaixi Luo
Дата:
Сообщение: Slow query on primary server runs fast on hot standby
Следующее
От: Sameer Kumar
Дата:
Сообщение: Re: Slow query on primary server runs fast on hot standby