Re: Fulltext search configuration

Поиск
Список
Период
Сортировка
От Oleg Bartunov
Тема Re: Fulltext search configuration
Дата
Msg-id Pine.LNX.4.64.0902021829280.4158@sn.sai.msu.ru
обсуждение исходный текст
Ответ на Re: Fulltext search configuration  (Oleg Bartunov <oleg@sai.msu.su>)
Ответы Re: Fulltext search configuration  (Mohamed <mohamed5432154321@gmail.com>)
Список pgsql-general
On Mon, 2 Feb 2009, Oleg Bartunov wrote:

> On Mon, 2 Feb 2009, Mohamed wrote:
>
>> Hehe, ok..
>> I don't know either but I took some lines from Al-Jazeera :
>> http://aljazeera.net/portal
>>
>> just made the change you said and created it successfully and tried this :
>>
>> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ?????????
>> ?????')
>>
>> but I got nothing... :(
>
> Mohamed, what did you expect from ts_lexize ?  Please, provide us valuable
> information, else we can't help you.
>
>>
>> Is there a way of making sure that words not recognized also gets
>> indexed/searched for ? (Not that I think this is the problem)
>
> yes

Read http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
"A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token type that
the parser can return, a separate list of dictionaries is specified by the
configuration. When a token of that type is found by the parser, each
dictionary in the list is consulted in turn, until some dictionary recognizes
it as a known word. If it is identified as a stop word, or if no dictionary
recognizes the token, it will be discarded and not indexed or searched for.
The general rule for configuring a list of dictionaries is to place first
the most narrow, most specific dictionary, then the more general dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or simple,
which recognizes everything."

quick example:

CREATE TEXT SEARCH CONFIGURATION arabic (
     COPY = english
);

=# \dF+ arabic
Text search configuration "public.arabic"
Parser: "pg_catalog.default"
       Token      | Dictionaries
-----------------+--------------
  asciihword      | english_stem
  asciiword       | english_stem
  email           | simple
  file            | simple
  float           | simple
  host            | simple
  hword           | english_stem
  hword_asciipart | english_stem
  hword_numpart   | simple
  hword_part      | english_stem
  int             | simple
  numhword        | simple
  numword         | simple
  sfloat          | simple
  uint            | simple
  url             | simple
  url_path        | simple
  version         | simple
  word            | english_stem

Then you can alter this configuration.



>
>
>>
>> / Moe
>>
>>
>>
>> On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
>>
>>> Mohamed,
>>>
>>> comment line in ar.affix
>>> #FLAG   long
>>> and creation of ispell dictionary will work. This is temp, solution.
>>> Teodor
>>> is working on fixing affix autorecognizing.
>>>
>>> I can't say anything about testing, since somebody should provide
>>> first test case. I don't know how to type arabic :)
>>>
>>>
>>> Oleg
>>>
>>> On Mon, 2 Feb 2009, Mohamed wrote:
>>>
>>>  Oleg, like I mentioned earlier. I have a different .affix file that I got
>>>> from Andrew with the stop file and I get no errors creating the
>>>> dictionary
>>>> using that one but I get nothing out from ts_lexize.
>>>> The size on that one is : 406,219 bytes
>>>> And the size on the hunspell one (first) : 406,229 bytes
>>>>
>>>> Little to close, don't you think ?
>>>>
>>>> It might be that the arabic hunspell (ayaspell) affix file is damaged on
>>>> some lines and I got the fixed one from Andrew.
>>>>
>>>> Just wanted to let you know.
>>>>
>>>> / Moe
>>>>
>>>>
>>>>
>>>> On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com>
>>>> wrote:
>>>>
>>>>  Ok, thank you Oleg.
>>>>> I have another dictionary package which is a conversion to hunspell
>>>>> aswell:
>>>>>
>>>>>
>>>>>
>>>>> http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
>>>>> (Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08
>>>>>
>>>>> And running that gives me this error : (again the affix file)
>>>>>
>>>>> ERROR:  wrong affix file format for flag
>>>>> CONTEXT:  line 560 of configuration file "C:/Program
>>>>> Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX
>>>>> 1013
>>>>> Y 6
>>>>> "
>>>>>
>>>>> / Moe
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
>>>>>
>>>>>  Mohamed,
>>>>>>
>>>>>> We are looking on the problem.
>>>>>>
>>>>>> Oleg
>>>>>>
>>>>>> On Mon, 2 Feb 2009, Mohamed wrote:
>>>>>>
>>>>>>  No, I don't. But the ts_lexize don't return anything so I figured
>>>>>> there
>>>>>>
>>>>>>> must
>>>>>>> be an error somehow.
>>>>>>> I think we are using the same dictionary + that I am using the
>>>>>>> stopwords
>>>>>>> file and a different affix file, because using the hunspell (ayaspell)
>>>>>>> .aff
>>>>>>> gives me this error :
>>>>>>>
>>>>>>> ERROR:  wrong affix file format for flag
>>>>>>> CONTEXT:  line 42 of configuration file "C:/Program
>>>>>>> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
>>>>>>>
>>>>>>> / Moe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
>>>>>>> daniel.chiaramello@golog.net> wrote:
>>>>>>>
>>>>>>>  Hi Mohamed.
>>>>>>>
>>>>>>>>
>>>>>>>> I don't know where you get the dictionary - I unsuccessfully tried
>>>>>>>> the
>>>>>>>> OpenOffice one by myself (the Ayaspell one), and I had no arabic
>>>>>>>> stopwords
>>>>>>>> file.
>>>>>>>>
>>>>>>>> Renaming the file is supposed to be enough (I did it successfully for
>>>>>>>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
>>>>>>>> When I tried to create the dictionary:
>>>>>>>>
>>>>>>>> CREATE TEXT SEARCH DICTIONARY ar_ispell (
>>>>>>>>   TEMPLATE = ispell,
>>>>>>>>   DictFile = ar_utf8,
>>>>>>>>   AffFile = ar_utf8,
>>>>>>>>   StopWords = english
>>>>>>>> );
>>>>>>>>
>>>>>>>> I had an error:
>>>>>>>>
>>>>>>>> ERREUR:  mauvais format de fichier affixe pour le drapeau
>>>>>>>> CONTEXTE : ligne 42 du fichier de configuration ?
>>>>>>>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa      Y
>>>>>>>> 40
>>>>>>>>
>>>>>>>> (which means Bad format of Affix file for flag, line 42 of
>>>>>>>> configuration
>>>>>>>> file)
>>>>>>>>
>>>>>>>> Do you have an error when creating your dictionary?
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>> Mohamed a ?crit :
>>>>>>>>
>>>>>>>>
>>>>>>>> I have ran into some problems here.
>>>>>>>>  I am trying to implement arabic fulltext search on three columns.
>>>>>>>>
>>>>>>>>  To create a dictionary I have a hunspell dictionary and and arabic
>>>>>>>> stop
>>>>>>>> file.
>>>>>>>>
>>>>>>>>  CREATE TEXT SEARCH DICTIONARY hunspell_dic (
>>>>>>>>   TEMPLATE = ispell,
>>>>>>>>   DictFile = hunarabic,
>>>>>>>>   AffFile = hunarabic,
>>>>>>>>   StopWords = arabic
>>>>>>>> );
>>>>>>>>
>>>>>>>>
>>>>>>>>  1) The problem is that the hunspell contains a .dic and a .aff file
>>>>>>>> but
>>>>>>>> the configuration requeries a .dict and .affix file. I have tried to
>>>>>>>> change
>>>>>>>> the endings but with no success.
>>>>>>>>
>>>>>>>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
>>>>>>>>
>>>>>>>> 3) How can I convert my .dic and .aff to valid .dict and .affix ?
>>>>>>>>
>>>>>>>> 4) I have read that when using dictionaries, if a word is not
>>>>>>>> recognized
>>>>>>>> by
>>>>>>>> any dictionary it will not be indexed. I find that troublesome. I
>>>>>>>> would
>>>>>>>> like
>>>>>>>> everything but the stop words to be indexed. I guess this might be a
>>>>>>>> step
>>>>>>>> that I am not ready for yet, but just wanted to put it out there.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  Also I would like to know how the process of the fulltext search
>>>>>>>> implementation looks like, from config to search.
>>>>>>>>
>>>>>>>>  Create dictionary, then a text configuration, add dic to
>>>>>>>> configuration,
>>>>>>>> index columns with gin or gist ...
>>>>>>>>
>>>>>>>>  How does a search look like? Does it match against the gin/gist
>>>>>>>> index.
>>>>>>>> Have that index been built up using the dictionary/configuration, or
>>>>>>>> is
>>>>>>>> the
>>>>>>>> dictionary only used on search frases?
>>>>>>>>
>>>>>>>>  / Moe
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>        Regards,
>>>>>>               Oleg
>>>>>> _____________________________________________________________
>>>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>>>>> Sternberg Astronomical Institute, Moscow University, Russia
>>>>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>>>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>        Regards,
>>>                Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>> Sternberg Astronomical Institute, Moscow University, Russia
>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

В списке pgsql-general по дате отправления:

Предыдущее
От: Oleg Bartunov
Дата:
Сообщение: Re: Fulltext search configuration
Следующее
От: Mohamed
Дата:
Сообщение: Re: Fulltext search configuration