Re: Bug with Tsearch and tsvector

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Bug with Tsearch and tsvector
Дата
Msg-id 22254.1272339139@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Bug with Tsearch and tsvector  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Ответы Re: Bug with Tsearch and tsvector  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Список pgsql-bugs
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Hmm.  Having typed that, I'm staring at the # character, which is
> used to mark off an anchor within an HTML page identified by the
> URL.  Should we consider the # and anchor part of a URL?

Yeah, I would think so.

This discussion is making me think that my previous patch went in the
wrong direction.  The way that the existing code works is that after
seeing something that looks host-name-ish followed by a '/', it goes to
the FilePath state, which is why "/press.aspx" gets parsed as a file
name in my previous example.  It only goes to the URLPath state if,
while in FilePath state, it sees '?'.  This seems a tad bizarre, and
it means that anything we do to the URLPath rules will only affect the
part of a URL following a '?'.

What I think might be the right thing instead, if we are going to
tighten up what URLPath accepts, is to go directly to URLPath state
after seeing host-name-and-'/'.  This eliminates the problem of
sometimes reporting "file" where we would have said "url_path"
before, and gives us a chance to apply the URLPath rules uniformly
to all text following a hostname.

Attached is a patch that does it that way instead.  We'd probably
not want to apply this as-is, but should first tighten up what
characters URLPath allows, per Kevin's spec research.

I find that this patch does create a couple of changes in the regression
test outputs.  The reason is that it parses this case differently:
    select * from ts_debug('http://5aew.werc.ewr:8100/?');
Existing code says that that is

  alias   |  description  |       token        | dictionaries | dictionary |       lexemes
----------+---------------+--------------------+--------------+------------+----------------------
 protocol | Protocol head | http://            | {}           |            |
 host     | Host          | 5aew.werc.ewr:8100 | {simple}     | simple     | {5aew.werc.ewr:8100}
 blank    | Space symbols | /?                 | {}           |            |
(3 rows)

while with this patch we get

  alias   |  description  |        token         | dictionaries | dictionary |        lexemes
----------+---------------+----------------------+--------------+------------+------------------------
 protocol | Protocol head | http://              | {}           |            |
 url      | URL           | 5aew.werc.ewr:8100/? | {simple}     | simple     | {5aew.werc.ewr:8100/?}
 host     | Host          | 5aew.werc.ewr:8100   | {simple}     | simple     | {5aew.werc.ewr:8100}
 url_path | URL path      | /?                   | {simple}     | simple     | {/?}
(4 rows)

Offhand I see no reason to discriminate against "/?" as a URL path, so
this change seems fine to me, but it is a change.

Thoughts?

            regards, tom lane

Index: src/backend/tsearch/wparser_def.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.29
diff -c -r1.29 wparser_def.c
*** src/backend/tsearch/wparser_def.c    26 Apr 2010 17:10:18 -0000    1.29
--- src/backend/tsearch/wparser_def.c    27 Apr 2010 03:27:08 -0000
***************
*** 707,715 ****
      int            res = 0;

      tmpprs->state = newTParserPosition(tmpprs->state);
!     tmpprs->state->state = TPS_InFileFirst;

!     if (TParserGet(tmpprs) && (tmpprs->type == URLPATH || tmpprs->type == FILEPATH))
      {
          prs->state->posbyte += tmpprs->lenbytetoken;
          prs->state->poschar += tmpprs->lenchartoken;
--- 707,715 ----
      int            res = 0;

      tmpprs->state = newTParserPosition(tmpprs->state);
!     tmpprs->state->state = TPS_InURLPathFirst;

!     if (TParserGet(tmpprs) && tmpprs->type == URLPATH)
      {
          prs->state->posbyte += tmpprs->lenbytetoken;
          prs->state->poschar += tmpprs->lenchartoken;
***************
*** 1441,1447 ****
      {p_isdigit, 0, A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '.', A_NEXT, TPS_InPathFirst, 0, NULL},
      {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL},
-     {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL},
      {p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL}
  };
--- 1441,1446 ----
***************
*** 1488,1494 ****
      {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '-', A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
-     {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, FILEPATH, NULL}
  };

--- 1487,1492 ----
***************
*** 1504,1510 ****
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
!     {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

--- 1502,1510 ----
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
!     {p_iseqC, '<', A_POP, TPS_Null, 0, NULL},
!     {p_iseqC, '>', A_POP, TPS_Null, 0, NULL},
!     {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

***************
*** 1516,1521 ****
--- 1516,1523 ----
      {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '<', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '>', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
  };

В списке pgsql-bugs по дате отправления:

Предыдущее
От: "Kevin Grittner"
Дата:
Сообщение: Re: Bug with Tsearch and tsvector
Следующее
От: "Kevin Grittner"
Дата:
Сообщение: Re: Bug with Tsearch and tsvector