Re: Bug with Tsearch and tsvector

Поиск

Список

Период

Сортировка

От	Kevin Grittner
Тема	Re: Bug with Tsearch and tsvector
Дата	26 апреля 2010 г. 17:55:17
Msg-id	4BD5B7500200002500030E22@gw.wicourts.gov обсуждение исходный текст
Ответ на	Re: Bug with Tsearch and tsvector (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: Bug with Tsearch and tsvector
Список	pgsql-bugs

Дерево обсуждения

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Hmm, thanks for the reference, but I'm not sure this is specifying
> quite what we want to get at.  In particular I note that it
> excludes '%' on the grounds that that ought to be escaped, so I
> guess this is specifying the characters allowed in an underlying
> URI, *not* the textual representation of a URI.

I'm not sure I follow you here -- % is disallowed "raw" because it
is itself the escape character to allow hexadecimal specification of
any disallowed character.  So, being the escape character itself, we
would need to allow it.

Section 2.4, taken as a whole, makes sense to me, and argues that we
should always treat any text representation of a URI (including a
URL) as being in escaped form.  If it weren't for backward
compatibility, I would feel strongly that we should take any of the
excluded characters as the end of a URI.

| A URI is always in an "escaped" form, since escaping or unescaping
| a completed URI might change its semantics.  Normally, the only
| time escape encodings can safely be made is when the URI is being
| created from its component parts; each component may have its own
| set of characters that are reserved, so only the mechanism
| responsible for generating or interpreting that component can
| determine whether or not escaping a character will change its
| semantics. Likewise, a URI must be separated into its components
| before the escaped characters within those components can be
| safely decoded.

> Still, it seems like this is a sufficient defense against any
> complaints we might get for not treating "<" or ">" as part of a
> URL.

I would think so.

> I wonder whether we ought to reject any of the other characters
> listed here too.  Right now, the InURLPath state seems to eat
> everything until a space, quote, or double quote mark.  We could
> easily make it stop at "<" or ">" too, but what else?

From the RFC:

| control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
| space       = <US-ASCII coded character 20 hexadecimal>
| delims      = "<" | ">" | "#" | "%" | <">
| unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Except, of course, that since % is the escape character, it is OK.

Hmm.  Having typed that, I'm staring at the # character, which is
used to mark off an anchor within an HTML page identified by the
URL.  Should we consider the # and anchor part of a URL?  Any other
questionable characters?

-Kevin

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Bug with Tsearch and tsvector