Re: Bug with Tsearch and tsvector

Поиск
Список
Период
Сортировка
От Kevin Grittner
Тема Re: Bug with Tsearch and tsvector
Дата
Msg-id 4BD5B7500200002500030E22@gw.wicourts.gov
обсуждение исходный текст
Ответ на Re: Bug with Tsearch and tsvector  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Bug with Tsearch and tsvector  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Hmm, thanks for the reference, but I'm not sure this is specifying
> quite what we want to get at.  In particular I note that it
> excludes '%' on the grounds that that ought to be escaped, so I
> guess this is specifying the characters allowed in an underlying
> URI, *not* the textual representation of a URI.

I'm not sure I follow you here -- % is disallowed "raw" because it
is itself the escape character to allow hexadecimal specification of
any disallowed character.  So, being the escape character itself, we
would need to allow it.

Section 2.4, taken as a whole, makes sense to me, and argues that we
should always treat any text representation of a URI (including a
URL) as being in escaped form.  If it weren't for backward
compatibility, I would feel strongly that we should take any of the
excluded characters as the end of a URI.

| A URI is always in an "escaped" form, since escaping or unescaping
| a completed URI might change its semantics.  Normally, the only
| time escape encodings can safely be made is when the URI is being
| created from its component parts; each component may have its own
| set of characters that are reserved, so only the mechanism
| responsible for generating or interpreting that component can
| determine whether or not escaping a character will change its
| semantics. Likewise, a URI must be separated into its components
| before the escaped characters within those components can be
| safely decoded.

> Still, it seems like this is a sufficient defense against any
> complaints we might get for not treating "<" or ">" as part of a
> URL.

I would think so.

> I wonder whether we ought to reject any of the other characters
> listed here too.  Right now, the InURLPath state seems to eat
> everything until a space, quote, or double quote mark.  We could
> easily make it stop at "<" or ">" too, but what else?

From the RFC:

| control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
| space       = <US-ASCII coded character 20 hexadecimal>
| delims      = "<" | ">" | "#" | "%" | <">
| unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Except, of course, that since % is the escape character, it is OK.

Hmm.  Having typed that, I'm staring at the # character, which is
used to mark off an anchor within an HTML page identified by the
URL.  Should we consider the # and anchor part of a URL?  Any other
questionable characters?

-Kevin

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Bug with Tsearch and tsvector
Следующее
От: "Kevin Grittner"
Дата:
Сообщение: Re: Bug with Tsearch and tsvector