Обсуждение: BUG #6327: Prefix full-text-search fails for hosts with complicated names
BUG #6327: Prefix full-text-search fails for hosts with complicated names
От
Marcin.Kasperski@mekk.waw.pl
Дата:
The following bug has been logged on the website: Bug reference: 6327 Logged by: Marcin Kasperski Email address: Marcin.Kasperski@mekk.waw.pl PostgreSQL version: 9.1.1 Operating system: Linux Description:=20=20=20=20=20=20=20=20 Synopsis =3D=3D=3D=3D=3D=3D=3D=3D=3D 'goog:*' matches google.com but 'e-goog:*' does not match e-google.com Example SQL =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Try the queries below. Note ismatch column, which is t in the former, and f in the latter case (IMHO should be t in both). SELECT a query, b message, a@@b ismatch FROM ( SELECT TO_TSQUERY('english', 'goog:*') a, TO_TSVECTOR('english', 'See google.com') b) as foo; SELECT a query, b message, a@@b ismatch FROM ( SELECT TO_TSQUERY('english', 'e-goog:*') a,=20 TO_TSVECTOR('english', 'See e-google.com') b) as foo;
Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names
От
Euler Taveira de Oliveira
Дата:
On 05-12-2011 09:40, Marcin.Kasperski@mekk.waw.pl wrote: > 'goog:*' matches google.com > but > 'e-goog:*' does not match e-google.com > It is a known limitation. The text search parser ignores some uncommon cases. See TODO and archives. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Marcin.Kasperski@mekk.waw.pl writes: > Synopsis > ========= > 'goog:*' matches google.com > but > 'e-goog:*' does not match e-google.com The reason for this seems to be that the pattern is treated as a hyphenated word: regression=# select TO_TSQUERY('english', 'e-goog:*'); to_tsquery ------------------------------- 'e-goog':* & 'e':* & 'goog':* (1 row) but the hostname isn't: regression=# select TO_TSVECTOR('english', 'See e-google.com'); to_tsvector -------------------------- 'e-google.com':2 'see':1 (1 row) If you change the text so it's not recognized as a hostname, you get lexemes that would match the query: regression=# select TO_TSVECTOR('english', 'See e-google com'); to_tsvector --------------------------------------------- 'com':5 'e':3 'e-googl':2 'googl':4 'see':1 (1 row) Possibly we could fix this by hacking the ts parser so that it would also apply the hyphenated-word rules to a hostname containing a dash. In general though, there are always going to be cases where prefix match doesn't work because of dictionary transformations ... regards, tom lane
Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names
От
Euler Taveira de Oliveira
Дата:
On 05-12-2011 12:29, Marcin Kasperski wrote: >>> 'goog:*' matches google.com >>> but 'e-goog:*' does not match e-google.com >>> >> It is a known limitation. The text search parser ignores some uncommon cases. >> See TODO and archives. > > Could you suggest me what to look for? I don't see anything related on > http://wiki.postgresql.org/wiki/Todo#Text_Search > and I already tried numerous searches to find similar problems, but > failed to locate anything related > Improve handling of plus signs in email address user names, and perhaps improve URL parsing Search for "url text search parser" in the archives. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names
От
Oleg Bartunov
Дата:
On Mon, 5 Dec 2011, Tom Lane wrote: > Marcin.Kasperski@mekk.waw.pl writes: >> Synopsis >> ========= > >> 'goog:*' matches google.com >> but >> 'e-goog:*' does not match e-google.com > > The reason for this seems to be that the pattern is treated as a > hyphenated word: > > regression=# select TO_TSQUERY('english', 'e-goog:*'); > to_tsquery > ------------------------------- > 'e-goog':* & 'e':* & 'goog':* > (1 row) > > but the hostname isn't: > > regression=# select TO_TSVECTOR('english', 'See e-google.com'); > to_tsvector > -------------------------- > 'e-google.com':2 'see':1 > (1 row) > > If you change the text so it's not recognized as a hostname, you get > lexemes that would match the query: > > regression=# select TO_TSVECTOR('english', 'See e-google com'); > to_tsvector > --------------------------------------------- > 'com':5 'e':3 'e-googl':2 'googl':4 'see':1 > (1 row) > > Possibly we could fix this by hacking the ts parser so that it would > also apply the hyphenated-word rules to a hostname containing a dash. > > In general though, there are always going to be cases where prefix > match doesn't work because of dictionary transformations ... I'd index 'after dictionary transformations' lexemes as well as an original to let prefix march always work. > > regards, tom lane > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83