Обсуждение: BUG #6327: Prefix full-text-search fails for hosts with complicated names

Поиск
Список
Период
Сортировка

BUG #6327: Prefix full-text-search fails for hosts with complicated names

От
Marcin.Kasperski@mekk.waw.pl
Дата:
The following bug has been logged on the website:

Bug reference:      6327
Logged by:          Marcin Kasperski
Email address:      Marcin.Kasperski@mekk.waw.pl
PostgreSQL version: 9.1.1
Operating system:   Linux
Description:=20=20=20=20=20=20=20=20

Synopsis
=3D=3D=3D=3D=3D=3D=3D=3D=3D

'goog:*'  matches  google.com
but
'e-goog:*' does not match e-google.com

Example SQL
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

Try the queries below. Note ismatch column, which is t in the former, and f
in the latter case (IMHO should be t in both).

SELECT a query, b message, a@@b ismatch FROM (
   SELECT TO_TSQUERY('english', 'goog:*') a,
          TO_TSVECTOR('english', 'See google.com') b) as foo;

SELECT a query, b message, a@@b ismatch FROM (
   SELECT TO_TSQUERY('english', 'e-goog:*') a,=20
          TO_TSVECTOR('english', 'See e-google.com') b) as foo;

Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names

От
Euler Taveira de Oliveira
Дата:
On 05-12-2011 09:40, Marcin.Kasperski@mekk.waw.pl wrote:
> 'goog:*'  matches  google.com
> but
> 'e-goog:*' does not match e-google.com
>
It is a known limitation. The text search parser ignores some uncommon cases.
See TODO and archives.


--
   Euler Taveira de Oliveira - Timbira       http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names

От
Tom Lane
Дата:
Marcin.Kasperski@mekk.waw.pl writes:
> Synopsis
> =========

> 'goog:*'  matches  google.com
> but
> 'e-goog:*' does not match e-google.com

The reason for this seems to be that the pattern is treated as a
hyphenated word:

regression=# select TO_TSQUERY('english', 'e-goog:*');
          to_tsquery
-------------------------------
 'e-goog':* & 'e':* & 'goog':*
(1 row)

but the hostname isn't:

regression=# select TO_TSVECTOR('english', 'See e-google.com');
       to_tsvector
--------------------------
 'e-google.com':2 'see':1
(1 row)

If you change the text so it's not recognized as a hostname, you get
lexemes that would match the query:

regression=# select TO_TSVECTOR('english', 'See e-google com');
                 to_tsvector
---------------------------------------------
 'com':5 'e':3 'e-googl':2 'googl':4 'see':1
(1 row)

Possibly we could fix this by hacking the ts parser so that it would
also apply the hyphenated-word rules to a hostname containing a dash.

In general though, there are always going to be cases where prefix
match doesn't work because of dictionary transformations ...

            regards, tom lane

Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names

От
Euler Taveira de Oliveira
Дата:
On 05-12-2011 12:29, Marcin Kasperski wrote:
>>> 'goog:*'  matches  google.com
>>> but 'e-goog:*' does not match e-google.com
>>>
>> It is a known limitation. The text search parser ignores some uncommon cases.
>> See TODO and archives.
>
> Could you suggest me what to look for? I don't see anything related on
> http://wiki.postgresql.org/wiki/Todo#Text_Search
> and I already tried numerous  searches to find similar problems, but
> failed to locate anything related…
>
Improve handling of plus signs in email address user names, and perhaps
improve URL parsing

Search for "url text search parser" in the archives.


--
   Euler Taveira de Oliveira - Timbira       http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names

От
Oleg Bartunov
Дата:
On Mon, 5 Dec 2011, Tom Lane wrote:

> Marcin.Kasperski@mekk.waw.pl writes:
>> Synopsis
>> =========
>
>> 'goog:*'  matches  google.com
>> but
>> 'e-goog:*' does not match e-google.com
>
> The reason for this seems to be that the pattern is treated as a
> hyphenated word:
>
> regression=# select TO_TSQUERY('english', 'e-goog:*');
>          to_tsquery
> -------------------------------
> 'e-goog':* & 'e':* & 'goog':*
> (1 row)
>
> but the hostname isn't:
>
> regression=# select TO_TSVECTOR('english', 'See e-google.com');
>       to_tsvector
> --------------------------
> 'e-google.com':2 'see':1
> (1 row)
>
> If you change the text so it's not recognized as a hostname, you get
> lexemes that would match the query:
>
> regression=# select TO_TSVECTOR('english', 'See e-google com');
>                 to_tsvector
> ---------------------------------------------
> 'com':5 'e':3 'e-googl':2 'googl':4 'see':1
> (1 row)
>
> Possibly we could fix this by hacking the ts parser so that it would
> also apply the hyphenated-word rules to a hostname containing a dash.
>
> In general though, there are always going to be cases where prefix
> match doesn't work because of dictionary transformations ...

I'd index 'after dictionary transformations' lexemes as well as an
original to let prefix march always work.

>
>             regards, tom lane
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83