Re: [GENERAL] Text search parser's treatment of URLs and emails

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: [GENERAL] Text search parser's treatment of URLs and emails
Дата
Msg-id 201010122203.o9CM3RW09263@momjian.us
обсуждение исходный текст
Ответы Re: Re: [GENERAL] Text search parser's treatment of URLs and emails  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [GENERAL] Text search parser's treatment of URLs and emails  (Christian Ullrich <chris@chrullrich.net>)
Список pgsql-hackers
[ sent to hackers where it belongs ]

Thom Brown wrote:
> Hi,
> 
> I noticed that if I run this:
> 
> SELECT alias, description, token FROM
> ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');
> 
> I get:
> 
>   alias   |  description  |                              token
> ----------+---------------+-----------------------------------------------------------------
>  protocol | Protocol head | http://
>  url      | URL           |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
>  host     | Host          | www.postgresql.org:2345
>  url_path | URL path      |
> /directory/page.html?version=9.1&build=alpha1#summary
> (4 rows)
> 
> 
> It could be me being picky, but I don't regard parameters or page
> fragments as part of the URL path.  Ideally, I'd sort of expect:
> 
>     alias     |  description  |                              token
> --------------+---------------+-----------------------------------------------------------------
>  protocol     | Protocol head | http://
>  url          | URL           |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
>  host         | Host          | www.postgresql.org
>  port         | Port          | 2345
>  url_path     | URL path      | /directory/page.html
>  query_string | Query string  | version=9.1&build=alpha1
>  fragment     | Page fragment | summary
> (7 rows)
> 
> ... of course that's if there was support for query strings and page
> fragments, which there isn't.  But if changes were made to support my
> definition of a URL path, they'd have to be considered breaking
> changes.
> 

Wow, that is a tough one.  One the one hand, it seems nice to be able to
split stuff out more, but on the other hand we would be making url_path
less useful because people would need to piece things together to get
the old behavior.  In fact to piece things together we would need to add
'?' and '#' optionally, which seems kind of hard.  Perhaps we should
keep url_path unchanged and add file_path that has your suggestion. 
That would allow more fine-grained control without breaking backward
compatibility.  We already duplicate some data with url and url_path, so
having file_path as another place we duplicate some seems OK.

> But my main gripe is with the name "url_path".
> 
> Also:
> 
> SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com');
> 
> Yields:
> 
>    alias   |   description   |       token
> -----------+-----------------+--------------------
>  asciiword | Word, all ASCII | myname
>  blank     | Space symbols   | +
>  email     | Email address   | priority@gmail.com
> (3 rows)
> 
> The entire string I entered is a valid email address, and isn't
> totally uncommon.  Shouldn't that take such email address styles be
> taken into account?  The example above incorrectly identifies the
> email address since the real destination address would most likely be
> myname@gmail.com.

I had no idea '+' could be part of an email address, and in fact it is a
modifier that is stripped off when delivering the email:
http://my.brandeis.edu/bboard/q-and-a-fetch-msg?msg_id=0000Nu

I didn't even know that was possible.  It is used as an email delivery
flag.  I agree that needs to be corrected.  We fixed URLs in 9.0 with:
      Use more standards-compliant rules for parsing URL tokens      (Tom Lane)

so I think it is reasonable to fix email addresses in 9.1.  Care to
submit a patch?   You can lookup Tom's change as a guide.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Josh Berkus
Дата:
Сообщение: Re: SQL command to edit postgresql.conf, with comments
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Review: Fix snapshot taking inconsistencies