Обсуждение: Clarification of the "simple" dictionary

Поиск
Список
Период
Сортировка

Clarification of the "simple" dictionary

От
Andreas Joseph Krogh
Дата:
Hi. It's not clear to me if the "simple" dictionary uses stopwords or
not, does it?
Can someone please post a complete description of what the "simple"
dict. does?

--
Andreas Joseph Krogh<andreak@officenet.no>
Senior Software Developer / CTO
------------------------+---------------------------------------------+
OfficeNet AS            | The most difficult thing in the world is to |
Rosenholmveien 25       | know how to do a thing and to watch         |
1414 Trollåsen          | somebody else doing it wrong, without       |
NORWAY                  | comment.                                    |
                         |                                             |
Tlf:    +47 24 15 38 90 |                                             |
Fax:    +47 24 15 38 91 |                                             |
Mobile: +47 909  56 963 |                                             |
------------------------+---------------------------------------------+


Re: Clarification of the "simple" dictionary

От
John Gage
Дата:
The easiest way to look at this is to give the simple dictionary a
document with to_tsvector() and see if stopwords pop out.

In my experience they do.  In my experience, the simple dictionary
just breaks the document down into the space etc. separated words in
the document.  It doesn't analyze further.

John


On Jul 22, 2010, at 4:15 PM, Andreas Joseph Krogh wrote:

> Hi. It's not clear to me if the "simple" dictionary uses stopwords
> or not, does it?
> Can someone please post a complete description of what the "simple"
> dict. does?
>
> --
> Andreas Joseph Krogh<andreak@officenet.no>
> Senior Software Developer / CTO
> ------------------------
> +---------------------------------------------+
> OfficeNet AS            | The most difficult thing in the world is
> to |
> Rosenholmveien 25       | know how to do a thing and to
> watch         |
> 1414 Trollåsen          | somebody else doing it wrong,
> without       |
> NORWAY                  |
> comment.                                    |
>                        |                                             |
> Tlf:    +47 24 15 38 90
> |                                             |
> Fax:    +47 24 15 38 91
> |                                             |
> Mobile: +47 909  56 963
> |                                             |
> ------------------------
> +---------------------------------------------+
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Re: Clarification of the "simple" dictionary

От
Andreas Joseph Krogh
Дата:
On 07/22/2010 06:27 PM, John Gage wrote:
> The easiest way to look at this is to give the simple dictionary a
> document with to_tsvector() and see if stopwords pop out.
>
> In my experience they do.  In my experience, the simple dictionary
> just breaks the document down into the space etc. separated words in
> the document.  It doesn't analyze further.

That's my experience too, I just want to make sure it doesn't actually
have any stopwords which I've missed. Trying many phrases and checking
for stopwords isn't really proving anything.

Can anybody confirm the "simple" dict. only lowercases the words and
"uniques" them?

--
Andreas Joseph Krogh<andreak@officenet.no>
Senior Software Developer / CTO
------------------------+---------------------------------------------+
OfficeNet AS            | The most difficult thing in the world is to |
Rosenholmveien 25       | know how to do a thing and to watch         |
1414 Trollåsen          | somebody else doing it wrong, without       |
NORWAY                  | comment.                                    |
                         |                                             |
Tlf:    +47 24 15 38 90 |                                             |
Fax:    +47 24 15 38 91 |                                             |
Mobile: +47 909  56 963 |                                             |
------------------------+---------------------------------------------+


Re: Clarification of the "simple" dictionary

От
Oleg Bartunov
Дата:
Don't guess, but read docs
http://www.postgresql.org/docs/8.4/interactive/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

12.6.2. Simple Dictionary

The simple dictionary template operates by converting the input token to lower case and checking it against a file of
stopwords. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the
lower-casedform of the word is returned as the normalized lexeme. Alternatively, the dictionary can be configured to
reportnon-stop-words as unrecognized, allowing them to be passed on to the next dictionary in the list. 

d=# \dFd+ simple
                                           List of text search dictionaries
    Schema   |  Name  |     Template      | Init options |                        Description
------------+--------+-------------------+--------------+-----------------------------------------------------------
  pg_catalog | simple | pg_catalog.simple |              | simple dictionary: just lower case and check for stopword

By default it has no Init options, so it doesn't check for stopwords.

On Thu, 22 Jul 2010, Andreas Joseph Krogh wrote:

> On 07/22/2010 06:27 PM, John Gage wrote:
>> The easiest way to look at this is to give the simple dictionary a document
>> with to_tsvector() and see if stopwords pop out.
>>
>> In my experience they do.  In my experience, the simple dictionary just
>> breaks the document down into the space etc. separated words in the
>> document.  It doesn't analyze further.
>
> That's my experience too, I just want to make sure it doesn't actually have
> any stopwords which I've missed. Trying many phrases and checking for
> stopwords isn't really proving anything.
>
> Can anybody confirm the "simple" dict. only lowercases the words and
> "uniques" them?
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Clarification of the "simple" dictionary

От
Andreas Joseph Krogh
Дата:
On 07/22/2010 07:44 PM, Oleg Bartunov wrote:
> Don't guess, but read docs
> http://www.postgresql.org/docs/8.4/interactive/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY
>
>
> 12.6.2. Simple Dictionary
>
> The simple dictionary template operates by converting the input token
> to lower case and checking it against a file of stop words. If it is
> found in the file then an empty array is returned, causing the token
> to be discarded. If not, the lower-cased form of the word is returned
> as the normalized lexeme. Alternatively, the dictionary can be
> configured to report non-stop-words as unrecognized, allowing them to
> be passed on to the next dictionary in the list.
>
> d=# \dFd+ simple
>                                           List of text search
> dictionaries
>    Schema   |  Name  |     Template      | Init options
> |                        Description
> ------------+--------+-------------------+--------------+-----------------------------------------------------------
>
>  pg_catalog | simple | pg_catalog.simple |              | simple
> dictionary: just lower case and check for stopword
>
> By default it has no Init options, so it doesn't check for stopwords.

Guess what - I *have* read the docs which sais "...and checking it
against a file of stop words". What was unclear to me was whether or not
it was configured with a stopwords-file or not as default, which is not
the case I understand from your reply. Very good, fits my needs like a
glove:-) It might be worth considering updating the docs to make this
clearer?

So - can we rely on "simple" to remain this way forever (no Init
options) or is it better to make a copy of it with the same properties
as today?

It seems "simple" + the unaccent dict. available in 9.0 saves my day,
thanks Mr. Bartunov.

--
Andreas Joseph Krogh<andreak@officenet.no>
Senior Software Developer / CTO
------------------------+---------------------------------------------+
OfficeNet AS            | The most difficult thing in the world is to |
Rosenholmveien 25       | know how to do a thing and to watch         |
1414 Trollåsen          | somebody else doing it wrong, without       |
NORWAY                  | comment.                                    |
                         |                                             |
Tlf:    +47 24 15 38 90 |                                             |
Fax:    +47 24 15 38 91 |                                             |
Mobile: +47 909  56 963 |                                             |
------------------------+---------------------------------------------+


Re: Clarification of the "simple" dictionary

От
Oleg Bartunov
Дата:
Andreas,

I'd create myself copy of dictionary to be independent on system changes.

Oleg
On Thu, 22 Jul 2010, Andreas Joseph Krogh wrote:

> On 07/22/2010 07:44 PM, Oleg Bartunov wrote:
>> Don't guess, but read docs
>> http://www.postgresql.org/docs/8.4/interactive/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY
>>
>> 12.6.2. Simple Dictionary
>>
>> The simple dictionary template operates by converting the input token to
>> lower case and checking it against a file of stop words. If it is found in
>> the file then an empty array is returned, causing the token to be
>> discarded. If not, the lower-cased form of the word is returned as the
>> normalized lexeme. Alternatively, the dictionary can be configured to
>> report non-stop-words as unrecognized, allowing them to be passed on to the
>> next dictionary in the list.
>>
>> d=# \dFd+ simple
>>                                           List of text search dictionaries
>>    Schema   |  Name  |     Template      | Init options |
>> Description
>> ------------+--------+-------------------+--------------+-----------------------------------------------------------
>>  pg_catalog | simple | pg_catalog.simple |              | simple
>> dictionary: just lower case and check for stopword
>>
>> By default it has no Init options, so it doesn't check for stopwords.
>
> Guess what - I *have* read the docs which sais "...and checking it against a
> file of stop words". What was unclear to me was whether or not it was
> configured with a stopwords-file or not as default, which is not the case I
> understand from your reply. Very good, fits my needs like a glove:-) It might
> be worth considering updating the docs to make this clearer?
>
> So - can we rely on "simple" to remain this way forever (no Init options) or
> is it better to make a copy of it with the same properties as today?
>
> It seems "simple" + the unaccent dict. available in 9.0 saves my day, thanks
> Mr. Bartunov.
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Clarification of the "simple" dictionary

От
John Gage
Дата:
> By default it has no Init options, so it doesn't check for stopwords.

In the first place, this functionality is a rip-snorting home run on
Postgres.  I congratulate Oleg who I believe is one of the authors.

In the second, I too had not read (carefully) the documentation and am
very happy to find that I can eliminate stop words with 'simple'.
That will be a tremendous convenience going forward.

It turns out that using 'english' and getting stemmed lexemes is
extremely convenient too, but this functionality in 'simple' is
excellent.

Thanks,

John