Обсуждение: Text search parser's treatment of URLs and emails
Hi, I noticed that if I run this: SELECT alias, description, token FROM ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary'); I get: alias | description | token ----------+---------------+----------------------------------------------------------------- protocol | Protocol head | http:// url | URL | www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary host | Host | www.postgresql.org:2345 url_path | URL path | /directory/page.html?version=9.1&build=alpha1#summary (4 rows) It could be me being picky, but I don't regard parameters or page fragments as part of the URL path. Ideally, I'd sort of expect: alias | description | token --------------+---------------+----------------------------------------------------------------- protocol | Protocol head | http:// url | URL | www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary host | Host | www.postgresql.org port | Port | 2345 url_path | URL path | /directory/page.html query_string | Query string | version=9.1&build=alpha1 fragment | Page fragment | summary (7 rows) ... of course that's if there was support for query strings and page fragments, which there isn't. But if changes were made to support my definition of a URL path, they'd have to be considered breaking changes. But my main gripe is with the name "url_path". Also: SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com'); Yields: alias | description | token -----------+-----------------+-------------------- asciiword | Word, all ASCII | myname blank | Space symbols | + email | Email address | priority@gmail.com (3 rows) The entire string I entered is a valid email address, and isn't totally uncommon. Shouldn't that take such email address styles be taken into account? The example above incorrectly identifies the email address since the real destination address would most likely be myname@gmail.com. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
On 8 September 2010 21:48, Thom Brown <thom@linux.com> wrote: > Hi, > > I noticed that if I run this: > > SELECT alias, description, token FROM > ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary'); > > I get: > > alias | description | token > ----------+---------------+----------------------------------------------------------------- > protocol | Protocol head | http:// > url | URL | > www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary > host | Host | www.postgresql.org:2345 > url_path | URL path | > /directory/page.html?version=9.1&build=alpha1#summary > (4 rows) > > > It could be me being picky, but I don't regard parameters or page > fragments as part of the URL path. Ideally, I'd sort of expect: > > alias | description | token > --------------+---------------+----------------------------------------------------------------- > protocol | Protocol head | http:// > url | URL | > www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary > host | Host | www.postgresql.org > port | Port | 2345 > url_path | URL path | /directory/page.html > query_string | Query string | version=9.1&build=alpha1 > fragment | Page fragment | summary > (7 rows) > > ... of course that's if there was support for query strings and page > fragments, which there isn't. But if changes were made to support my > definition of a URL path, they'd have to be considered breaking > changes. > > But my main gripe is with the name "url_path". > > Also: > > SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com'); > > Yields: > > alias | description | token > -----------+-----------------+-------------------- > asciiword | Word, all ASCII | myname > blank | Space symbols | + > email | Email address | priority@gmail.com > (3 rows) > > The entire string I entered is a valid email address, and isn't > totally uncommon. Shouldn't that take such email address styles be > taken into account? The example above incorrectly identifies the > email address since the real destination address would most likely be > myname@gmail.com. No opinions? -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
Thom Brown wrote: > Hi, > > I noticed that if I run this: > > SELECT alias, description, token FROM > ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary'); > > I get: > > alias | description | token > ----------+---------------+----------------------------------------------------------------- > protocol | Protocol head | http:// > url | URL | > www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary > host | Host | www.postgresql.org:2345 > url_path | URL path | > /directory/page.html?version=9.1&build=alpha1#summary > (4 rows) > > > It could be me being picky, but I don't regard parameters or page > fragments as part of the URL path. Ideally, I'd sort of expect: > > alias | description | token > --------------+---------------+----------------------------------------------------------------- > protocol | Protocol head | http:// > url | URL | > www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary > host | Host | www.postgresql.org > port | Port | 2345 > url_path | URL path | /directory/page.html > query_string | Query string | version=9.1&build=alpha1 > fragment | Page fragment | summary > (7 rows) > > ... of course that's if there was support for query strings and page > fragments, which there isn't. But if changes were made to support my > definition of a URL path, they'd have to be considered breaking > changes. > Wow, that is a tough one. One the one hand, it seems nice to be able to split stuff out more, but on the other hand we would be making url_path less useful because people would need to piece things together to get the old behavior. In fact to piece things together we would need to add '?' and '#' optionally, which seems kind of hard. Perhaps we should keep url_path unchanged and add file_path that has your suggestion. That would allow more fine-grained control without breaking backward compatibility. We already duplicate some data with url and url_path, so having file_path as another place we duplicate some seems OK. > But my main gripe is with the name "url_path". > > Also: > > SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com'); > > Yields: > > alias | description | token > -----------+-----------------+-------------------- > asciiword | Word, all ASCII | myname > blank | Space symbols | + > email | Email address | priority@gmail.com > (3 rows) > > The entire string I entered is a valid email address, and isn't > totally uncommon. Shouldn't that take such email address styles be > taken into account? The example above incorrectly identifies the > email address since the real destination address would most likely be > myname@gmail.com. I had no idea '+' could be part of an email address, and in fact it is a modifier that is stripped off when delivering the email: http://my.brandeis.edu/bboard/q-and-a-fetch-msg?msg_id=0000Nu I didn't even know that was possible. It is used as an email delivery flag. I agree that needs to be corrected. We fixed URLs in 9.0 with: Use more standards-compliant rules for parsing URL tokens (Tom Lane) so I think it is reasonable to fix email addresses in 9.1. Care to submit a patch? You can lookup Tom's change as a guide. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
I have added this as a TODO: * Improve handling of plus signs in email address user names, and perhaps improve URL parsing * http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php --------------------------------------------------------------------------- Thom Brown wrote: > Hi, > > I noticed that if I run this: > > SELECT alias, description, token FROM > ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary'); > > I get: > > alias | description | token > ----------+---------------+----------------------------------------------------------------- > protocol | Protocol head | http:// > url | URL | > www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary > host | Host | www.postgresql.org:2345 > url_path | URL path | > /directory/page.html?version=9.1&build=alpha1#summary > (4 rows) > > > It could be me being picky, but I don't regard parameters or page > fragments as part of the URL path. Ideally, I'd sort of expect: > > alias | description | token > --------------+---------------+----------------------------------------------------------------- > protocol | Protocol head | http:// > url | URL | > www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary > host | Host | www.postgresql.org > port | Port | 2345 > url_path | URL path | /directory/page.html > query_string | Query string | version=9.1&build=alpha1 > fragment | Page fragment | summary > (7 rows) > > ... of course that's if there was support for query strings and page > fragments, which there isn't. But if changes were made to support my > definition of a URL path, they'd have to be considered breaking > changes. > > But my main gripe is with the name "url_path". > > Also: > > SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com'); > > Yields: > > alias | description | token > -----------+-----------------+-------------------- > asciiword | Word, all ASCII | myname > blank | Space symbols | + > email | Email address | priority@gmail.com > (3 rows) > > The entire string I entered is a valid email address, and isn't > totally uncommon. Shouldn't that take such email address styles be > taken into account? The example above incorrectly identifies the > email address since the real destination address would most likely be > myname@gmail.com. > > -- > Thom Brown > Twitter: @darkixion > IRC (freenode): dark_ixion > Registered Linux user: #516935 > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +