Re: Bug with Tsearch and tsvector

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Bug with Tsearch and tsvector
Дата
Msg-id 11841.1272309833@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Bug with Tsearch and tsvector  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Bug with Tsearch and tsvector  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Список pgsql-bugs
I wrote:
> "Donald Fraser" <postgres@kiwi-fraser.net> writes:
>> Using the default tsearch configuration, for 'english', text is being wrongly parsed into the tsvector type.

> ts_debug shows that it's being parsed like this:

>       alias      |           description           |                 token                  |  dictionaries  |
dictionary |                 lexemes                   
>
-----------------+---------------------------------+----------------------------------------+----------------+--------------+------------------------------------------
>  tag             | XML tag                         | <span lang="EN-GB">                    | {}             |
     |  
>  protocol        | Protocol head                   | http://                                | {}             |
     |  
>  url             | URL                             | www.harewoodsolutions.co.uk/press.aspx | {simple}       | simple
     | {www.harewoodsolutions.co.uk/press.aspx} 
>  host            | Host                            | www.harewoodsolutions.co.uk            | {simple}       | simple
     | {www.harewoodsolutions.co.uk} 
>  url_path        | URL path                        | /press.aspx</span><span                | {simple}       | simple
     | {/press.aspx</span><span} 
>  blank           | Space symbols                   |                                        | {}             |
     |  
>  asciiword       | Word, all ASCII                 | lang                                   | {english_stem} |
english_stem| {lang} 
>  ... etc ...

> ie the critical point seems to be that url_path is willing to soak up a
> string containing "<" and ">", so the span tags don't get recognized as
> separate lexemes.  While that's "obviously" the wrong thing in this
> particular example, I'm not sure if it's the wrong thing in general.
> Can anyone comment on the frequency of usage of those two symbols in
> URLs?

> In any case it's weird that the URL lexeme doesn't span the same text
> as the url_path one, but I'm not sure which one we should consider
> wrong.

I poked at this a bit.  The reason for the inconsistency between the url
and url_path lexemes is that the InURLPathStart state transitions
directly to InURLPath, which is *not* consistent with what happens while
parsing the URL as a whole: p_isURLPath() starts the sub-parser in
InFileFirst state.  The attached proposed patch rectifies that by
transitioning to InFileFirst state instead.  A possible objection to
this fix is that you may get either a "file" or a "url_path" component
lexeme, where before you always got "url_path".  I'm not sure if that's
something to worry about or not; I'd tend to think there's nothing much
wrong with it.

The other change in the attached patch is to make InURLPath parsing
stop at "<" or ">", as per discussion.

With these changes I get

regression=# SELECT * from ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>');
  alias   |    description    |                 token                  | dictionaries | dictionary |
lexemes                  

----------+-------------------+----------------------------------------+--------------+------------+------------------------------------------
 protocol | Protocol head     | http://                                | {}           |            |
 url      | URL               | www.harewoodsolutions.co.uk/press.aspx | {simple}     | simple     |
{www.harewoodsolutions.co.uk/press.aspx}
 host     | Host              | www.harewoodsolutions.co.uk            | {simple}     | simple     |
{www.harewoodsolutions.co.uk}
 file     | File or path name | /press.aspx                            | {simple}     | simple     | {/press.aspx}
 tag      | XML tag           | </span>                                | {}           |            |
(5 rows)

as compared to the prior behavior

regression=# SELECT * from ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>');
  alias   |  description  |                 token                  | dictionaries | dictionary |
lexemes                  

----------+---------------+----------------------------------------+--------------+------------+------------------------------------------
 protocol | Protocol head | http://                                | {}           |            |
 url      | URL           | www.harewoodsolutions.co.uk/press.aspx | {simple}     | simple     |
{www.harewoodsolutions.co.uk/press.aspx}
 host     | Host          | www.harewoodsolutions.co.uk            | {simple}     | simple     |
{www.harewoodsolutions.co.uk}
 url_path | URL path      | /press.aspx</span>                     | {simple}     | simple     | {/press.aspx</span>}
(4 rows)

Neither change affects the current set of regression tests; but none the
less there's a potential compatibility issue here, so my thought is to
apply this only in HEAD.

Comments?

            regards, tom lane


Index: src/backend/tsearch/wparser_def.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.29
diff -c -r1.29 wparser_def.c
*** src/backend/tsearch/wparser_def.c    26 Apr 2010 17:10:18 -0000    1.29
--- src/backend/tsearch/wparser_def.c    26 Apr 2010 19:17:48 -0000
***************
*** 1504,1521 ****
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
      {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

  static const TParserStateActionItem actionTPS_InURLPathStart[] = {
!     {NULL, 0, A_NEXT, TPS_InURLPath, 0, NULL}
  };

  static const TParserStateActionItem actionTPS_InURLPath[] = {
      {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
  };
--- 1504,1526 ----
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
+     {p_iseqC, '<', A_POP, TPS_Null, 0, NULL},
+     {p_iseqC, '>', A_POP, TPS_Null, 0, NULL},
      {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

  static const TParserStateActionItem actionTPS_InURLPathStart[] = {
!     /* this should transition to same state that p_isURLPath starts in */
!     {NULL, 0, A_NEXT, TPS_InFileFirst, 0, NULL}
  };

  static const TParserStateActionItem actionTPS_InURLPath[] = {
      {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '<', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '>', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
  };

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Bug with Tsearch and tsvector
Следующее
От: "Kevin Grittner"
Дата:
Сообщение: Re: Bug with Tsearch and tsvector