Обсуждение: FTS with more than one language in body and with unknown query language?

Поиск
Список
Период
Сортировка

FTS with more than one language in body and with unknown query language?

От
Stefan Keller
Дата:
Hi,

I have a text corpus which contains either German or English docs and
I expect queries where I don't know if it's German or English. So I'd
like e.g. that a query "forest" matches "forest" in body_en but also
"Wald" in body_de.

I created a table with attributes body_en and body_de (type "text"). I
will use ts_vector/ts_query on the fly (don't need yet an index
(attributes)).

* Can FTS handle this multilingual situation?
* How to setup a text search configuration which e.g. stems en and de words?
* Should I create a synonym dictionary which contains word
translations en-de instead of synonyms en-en?
* Any hints to related work where FTS has been used in a multilingual context?

:Stefan


Re: FTS with more than one language in body and with unknown query language?

От
Artur Zakirov
Дата:
Hi,

On 14.07.2016 01:16, Stefan Keller wrote:
> Hi,
>
> I have a text corpus which contains either German or English docs and
> I expect queries where I don't know if it's German or English. So I'd
> like e.g. that a query "forest" matches "forest" in body_en but also
> "Wald" in body_de.
>
> I created a table with attributes body_en and body_de (type "text"). I
> will use ts_vector/ts_query on the fly (don't need yet an index
> (attributes)).
>
> * Can FTS handle this multilingual situation?

In my opinion, PostgreSQL cant handle it. It cant translate words from
one language to another, it just stems word from original form to basic
form. First you need to translate word from English to German, then
search word in the body_de attribute.

And the issue is complicated by the fact that one word could have
different meaning in the other language.

> * How to setup a text search configuration which e.g. stems en and de words?
> * Should I create a synonym dictionary which contains word
> translations en-de instead of synonyms en-en?

This synonym dictionary will contain a thousands entries. So it will
require a great effort to make this dictionary.

> * Any hints to related work where FTS has been used in a multilingual context?
>
> :Stefan
>
>

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: FTS with more than one language in body and with unknown query language?

От
Stefan Keller
Дата:
приве́т! Artur

Thanks for your explanations.

2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
> On 14.07.2016 01:16, Stefan Keller wrote:
...
>> * Should I create a synonym dictionary which contains word
>> translations en-de instead of synonyms en-en?
>
> This synonym dictionary will contain a thousands entries. So it will require
> a great effort to make this dictionary.

It's a domain-specific corpus of max. 1000 records of descriptive text
(metadata) about geographic data, like topographic map, land use
planning, etc.

...
>> * How to setup a text search configuration which e.g. stems en and de
>> words?

I still would like to give FTS a try with synonym dictionary (en-de).
Now, I'm wondering how to setup the configuration. I've seen examples
to process either english, german or russian alone. But I did not find
yet any documentation on how to setup the text search configuration
where a corpus contains two (or more) languages at same time in a
table (body_en and body_de).

:Stefan







2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
> Hi,
>
> On 14.07.2016 01:16, Stefan Keller wrote:
>>
>> Hi,
>>
>> I have a text corpus which contains either German or English docs and
>> I expect queries where I don't know if it's German or English. So I'd
>> like e.g. that a query "forest" matches "forest" in body_en but also
>> "Wald" in body_de.
>>
>> I created a table with attributes body_en and body_de (type "text"). I
>> will use ts_vector/ts_query on the fly (don't need yet an index
>> (attributes)).
>>
>> * Can FTS handle this multilingual situation?
>
>
> In my opinion, PostgreSQL cant handle it. It cant translate words from one
> language to another, it just stems word from original form to basic form.
> First you need to translate word from English to German, then search word in
> the body_de attribute.
>
> And the issue is complicated by the fact that one word could have different
> meaning in the other language.
>
>> * How to setup a text search configuration which e.g. stems en and de
>> words?
>> * Should I create a synonym dictionary which contains word
>> translations en-de instead of synonyms en-en?
>
>
> This synonym dictionary will contain a thousands entries. So it will require
> a great effort to make this dictionary.
>
>
>> * Any hints to related work where FTS has been used in a multilingual
>> context?
>>
>> :Stefan
>>
>>
>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company


Re: FTS with more than one language in body and with unknown query language?

От
Artur Zakirov
Дата:
Hello, Stefan!

On 15.07.2016 01:54, Stefan Keller wrote:
> приве́т! Artur
>
> Thanks for your explanations.
>
> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
>> On 14.07.2016 01:16, Stefan Keller wrote:
> ...
>>> * Should I create a synonym dictionary which contains word
>>> translations en-de instead of synonyms en-en?
>>
>> This synonym dictionary will contain a thousands entries. So it will require
>> a great effort to make this dictionary.
>
> It's a domain-specific corpus of max. 1000 records of descriptive text
> (metadata) about geographic data, like topographic map, land use
> planning, etc.

Oh, I understand. If you need word stemming you can use a thesaurus
dictionary. So you should do the following.

1 - Create the file $SHAREDIR/tsearch_data/geo.ths. Here is my example
of entry:

wald : forest

2 - PostgreSQL already has english_stem dictionary. In this solution you
do not need german_stem dictionary. Create the dictionary and the
configuration in PostgreSQL:

=# CREATE TEXT SEARCH DICTIONARY geo_ths (
     Template = thesaurus,
     DictFile = geo,
     Dictionary = pg_catalog.english_stem);
=# CREATE TEXT SEARCH CONFIGURATION geo_ths(COPY='simple');
=# ALTER TEXT SEARCH CONFIGURATION geo_ths
     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
         word, hword, hword_part
     WITH geo_ths, english_stem;

For example, we have the following data:

=# CREATE TABLE geo (id int, body_en text, body_de text);
=# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald');

Then you can execute the following queries:

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
to_tsquery('geo_ths', 'forest');
  id | body_en | body_de
----+---------+---------
   1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
to_tsquery('geo_ths', 'forest');
  id | body_en | body_de
----+---------+---------
   2 | (null)  | wald
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
to_tsquery('geo_ths', 'wald');
  id | body_en | body_de
----+---------+---------
   1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
to_tsquery('geo_ths', 'wald');
  id | body_en | body_de
----+---------+---------
   2 | (null)  | wald
(1 row)

This solution does not take into account the fact that queries in german
can be in various forms (for example, plural forms). It works well for
english. As a solution of this, you can put into geo.ths various forms
of word.

But maybe the solution above is what you need. Otherwise it can serves
as a concept of solution.

>
> ...
>>> * How to setup a text search configuration which e.g. stems en and de
>>> words?
>
> I still would like to give FTS a try with synonym dictionary (en-de).
> Now, I'm wondering how to setup the configuration. I've seen examples
> to process either english, german or russian alone. But I did not find
> yet any documentation on how to setup the text search configuration
> where a corpus contains two (or more) languages at same time in a
> table (body_en and body_de).
>
> :Stefan
>
>
>
>
>
>
>
> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
>> Hi,
>>
>> On 14.07.2016 01:16, Stefan Keller wrote:
>>>
>>> Hi,
>>>
>>> I have a text corpus which contains either German or English docs and
>>> I expect queries where I don't know if it's German or English. So I'd
>>> like e.g. that a query "forest" matches "forest" in body_en but also
>>> "Wald" in body_de.
>>>
>>> I created a table with attributes body_en and body_de (type "text"). I
>>> will use ts_vector/ts_query on the fly (don't need yet an index
>>> (attributes)).
>>>
>>> * Can FTS handle this multilingual situation?
>>
>>
>> In my opinion, PostgreSQL cant handle it. It cant translate words from one
>> language to another, it just stems word from original form to basic form.
>> First you need to translate word from English to German, then search word in
>> the body_de attribute.
>>
>> And the issue is complicated by the fact that one word could have different
>> meaning in the other language.
>>
>>> * How to setup a text search configuration which e.g. stems en and de
>>> words?
>>> * Should I create a synonym dictionary which contains word
>>> translations en-de instead of synonyms en-en?
>>
>>
>> This synonym dictionary will contain a thousands entries. So it will require
>> a great effort to make this dictionary.
>>
>>
>>> * Any hints to related work where FTS has been used in a multilingual
>>> context?
>>>
>>> :Stefan
>>>
>>>
>>
>> --
>> Artur Zakirov
>> Postgres Professional: http://www.postgrespro.com
>> Russian Postgres Company


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: FTS with more than one language in body and with unknown query language?

От
Stefan Keller
Дата:
Artur

2016-07-15 11:02 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
...
> This solution does not take into account the fact that queries in german can
> be in various forms (for example, plural forms). It works well for english.
> As a solution of this, you can put into geo.ths various forms of word.
>
> But maybe the solution above is what you need. Otherwise it can serve as a
> concept of solution.

Many thanks for this helpful input!
I actually expect that stemming takes place for english and german.
And we will in fact have queries in english and in german as well.
So I think we still have some issues to resolve...?

:Stefan


2016-07-15 11:02 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
> Hello, Stefan!
>
> On 15.07.2016 01:54, Stefan Keller wrote:
>>
>> приве́т! Artur
>>
>> Thanks for your explanations.
>>
>> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
>>>
>>> On 14.07.2016 01:16, Stefan Keller wrote:
>>
>> ...
>>>>
>>>> * Should I create a synonym dictionary which contains word
>>>> translations en-de instead of synonyms en-en?
>>>
>>>
>>> This synonym dictionary will contain a thousands entries. So it will
>>> require
>>> a great effort to make this dictionary.
>>
>>
>> It's a domain-specific corpus of max. 1000 records of descriptive text
>> (metadata) about geographic data, like topographic map, land use
>> planning, etc.
>
>
> Oh, I understand. If you need word stemming you can use a thesaurus
> dictionary. So you should do the following.
>
> 1 - Create the file $SHAREDIR/tsearch_data/geo.ths. Here is my example of
> entry:
>
> wald : forest
>
> 2 - PostgreSQL already has english_stem dictionary. In this solution you do
> not need german_stem dictionary. Create the dictionary and the configuration
> in PostgreSQL:
>
> =# CREATE TEXT SEARCH DICTIONARY geo_ths (
>     Template = thesaurus,
>     DictFile = geo,
>     Dictionary = pg_catalog.english_stem);
> =# CREATE TEXT SEARCH CONFIGURATION geo_ths(COPY='simple');
> =# ALTER TEXT SEARCH CONFIGURATION geo_ths
>     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>         word, hword, hword_part
>     WITH geo_ths, english_stem;
>
> For example, we have the following data:
>
> =# CREATE TABLE geo (id int, body_en text, body_de text);
> =# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald');
>
> Then you can execute the following queries:
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
> to_tsquery('geo_ths', 'forest');
>  id | body_en | body_de
> ----+---------+---------
>   1 | forest  | (null)
> (1 row)
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
> to_tsquery('geo_ths', 'forest');
>  id | body_en | body_de
> ----+---------+---------
>   2 | (null)  | wald
> (1 row)
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
> to_tsquery('geo_ths', 'wald');
>  id | body_en | body_de
> ----+---------+---------
>   1 | forest  | (null)
> (1 row)
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
> to_tsquery('geo_ths', 'wald');
>  id | body_en | body_de
> ----+---------+---------
>   2 | (null)  | wald
> (1 row)
>
> This solution does not take into account the fact that queries in german can
> be in various forms (for example, plural forms). It works well for english.
> As a solution of this, you can put into geo.ths various forms of word.
>
> But maybe the solution above is what you need. Otherwise it can serves as a
> concept of solution.
>
>
>>
>> ...
>>>>
>>>> * How to setup a text search configuration which e.g. stems en and de
>>>> words?
>>
>>
>> I still would like to give FTS a try with synonym dictionary (en-de).
>> Now, I'm wondering how to setup the configuration. I've seen examples
>> to process either english, german or russian alone. But I did not find
>> yet any documentation on how to setup the text search configuration
>> where a corpus contains two (or more) languages at same time in a
>> table (body_en and body_de).
>>
>> :Stefan
>>
>>
>>
>>
>>
>>
>>
>> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@postgrespro.ru>:
>>>
>>> Hi,
>>>
>>> On 14.07.2016 01:16, Stefan Keller wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I have a text corpus which contains either German or English docs and
>>>> I expect queries where I don't know if it's German or English. So I'd
>>>> like e.g. that a query "forest" matches "forest" in body_en but also
>>>> "Wald" in body_de.
>>>>
>>>> I created a table with attributes body_en and body_de (type "text"). I
>>>> will use ts_vector/ts_query on the fly (don't need yet an index
>>>> (attributes)).
>>>>
>>>> * Can FTS handle this multilingual situation?
>>>
>>>
>>>
>>> In my opinion, PostgreSQL cant handle it. It cant translate words from
>>> one
>>> language to another, it just stems word from original form to basic form.
>>> First you need to translate word from English to German, then search word
>>> in
>>> the body_de attribute.
>>>
>>> And the issue is complicated by the fact that one word could have
>>> different
>>> meaning in the other language.
>>>
>>>> * How to setup a text search configuration which e.g. stems en and de
>>>> words?
>>>> * Should I create a synonym dictionary which contains word
>>>> translations en-de instead of synonyms en-en?
>>>
>>>
>>>
>>> This synonym dictionary will contain a thousands entries. So it will
>>> require
>>> a great effort to make this dictionary.
>>>
>>>
>>>> * Any hints to related work where FTS has been used in a multilingual
>>>> context?
>>>>
>>>> :Stefan
>>>>
>>>>
>>>
>>> --
>>> Artur Zakirov
>>> Postgres Professional: http://www.postgrespro.com
>>> Russian Postgres Company
>
>
>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company


Re: FTS with more than one language in body and with unknown query language?

От
Artur Zakirov
Дата:
On 15.07.2016 21:34, Stefan Keller wrote:
> I actually expect that stemming takes place for english and german.
> And we will in fact have queries in english and in german as well.
> So I think we still have some issues to resolve...?

I performed the following things:
- patch for PostgreSQL:
   https://github.com/select-artur/postgres/tree/join_tsconfig
   It adds new option for FTS dictionary mapping (JOIN). I want to
propose this patch to -hackers.
- dict_translate dictionary based on dict_xsyn contrib:
   https://github.com/select-artur/dict_translate

This things are made for multilingual purpose and are interesting for
us. Maybe they will be helpful for you too.

Example:

1 - Create files:

$SHAREDIR/tsearch_data/geo_en.trn:

forest wald forst holz

$SHAREDIR/tsearch_data/geo_de.trn:

wald forest wood
forst forest wood
holz forest wood

2 - Execute queries:

=# CREATE TEXT SEARCH DICTIONARY geo_en (
     Template = translate,
     DictFile = geo_en,
     InputDict = pg_catalog.english_stem);
=# CREATE TEXT SEARCH DICTIONARY geo_de (
     Template = translate,
     DictFile = geo_de,
     InputDict = pg_catalog.german_stem);
=# CREATE TEXT SEARCH CONFIGURATION geo(COPY='simple');
=# ALTER TEXT SEARCH CONFIGURATION geo_ths
     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
         word, hword, hword_part
     WITH geo_en (JOIN), english_stem (JOIN), geo_de (JOIN), german_stem
(JOIN);

=# CREATE TABLE geo (id int, body_en text, body_de text);
=# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald');

3 - Sample queries:

=# SELECT * FROM geo WHERE to_tsvector('geo', body_en) @@
to_tsquery('geo', 'forests');
  id | body_en | body_de
----+---------+---------
   1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo', body_de) @@
to_tsquery('geo', 'forests');
  id | body_en | body_de
----+---------+---------
   2 | (null)  | wald
   3 | (null)  | forst
(2 rows)

=# SELECT * FROM geo WHERE to_tsvector('geo', body_en) @@
to_tsquery('geo', 'walde');
  id | body_en | body_de
----+---------+---------
   1 | forest  | (null)
(1 row)

I will be glad for your comments.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company