Обсуждение: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

Поиск
Список
Период
Сортировка

Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Bruce Momjian
Дата:
Oleg, Teodor, can you look at this?  I tried to fix it in wparser_def.c,
but couldn't figure out how.  Thanks.

---------------------------------------------------------------------------

Dan O'Hara wrote:
> 
> The following bug has been logged online:
> 
> Bug reference:      5021
> Logged by:          Dan O'Hara
> Email address:      danarasoftware@gmail.com
> PostgreSQL version: 8.3.7
> Operating system:   win32
> Description:        ts_parse doesn't recognize email addresses with
> underscores
> Details: 
> 
> In the following example, 
> 
> select distinct token as email 
> from ts_parse('default', ' first_last@yahoo.com '   )
> where tokid = 4
> 
> ts_parse returns last@yahoo.com rather than first_last@yahoo.com  It seems
> that any text prior to the underscore is truncated.  If the portion
> following the underscore is only numeric, such as this example,
> 
> select distinct token as email 
> from ts_parse('default', ' bill_2000@yahoo.com '   )
> where tokid = 4
> 
> then ts_parse returns nothing at all.
> 
> section 3.2.3 of RFC 5322 indicates that underscores are valid characters in
> an email address.
> 
> http://tools.ietf.org/html/rfc5322
> 
> -- 
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.comPG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard
drive,Christ can be your backup. +
 


Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Teodor Sigaev
Дата:
> Oleg, Teodor, can you look at this?  I tried to fix it in wparser_def.c,
> but couldn't figure out how.  Thanks.
>>
>> select distinct token as email
>> from ts_parse('default', ' first_last@yahoo.com '   )
>> where tokid = 4

Patch in attachment, it allows underscore in the middle of local part of email
in in host name (similarly to '-' character).


I'm not sure about backpatching, because it could break existing search
configuration.
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Вложения

Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Bruce Momjian
Дата:
Teodor Sigaev wrote:
> > Oleg, Teodor, can you look at this?  I tried to fix it in wparser_def.c,
> > but couldn't figure out how.  Thanks.
> >>
> >> select distinct token as email
> >> from ts_parse('default', ' first_last@yahoo.com '   )
> >> where tokid = 4
> 
> Patch in attachment, it allows underscore in the middle of local part of email 
> in in host name (similarly to '-' character).

Thanks, patch applied.

> I'm not sure about backpatching, because it could break existing search 
> configuration.

Agreed.  I don't think this warrants backpatching.

Here is the before behavior:
test=> select ts_parse('default', ' first_last@yahoo.com '   );      ts_parse-------------------- (12," ") (1,first)
(12,_)
-->     (4,last@yahoo.com) (12," ")(5 rows)

and the after-patch, fixed behavior:
test=> select ts_parse('default', ' first_last@yahoo.com '   );         ts_parse-------------------------- (12," ")
-->     (4,first_last@yahoo.com) (12," ")(3 rows)

I assume because this only expands the pattern space for email addresses
that there is no affect on binary upgrades with this patch.  Is that
correct?  Would an email address check on a binary-upgraded tsvector
index not match an email address with underscores?  Do we need a warning
in the release notes about this?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do


Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Alvaro Herrera
Дата:
Upon seeing this patch I considered that I use addresses such as
alvherre+stuff@something.org  and wondered how could this thing support
that.  I don't think we want extra parser stuff just to add whatever
random junk we want to support in email addresses ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Bruce Momjian
Дата:
Alvaro Herrera wrote:
> 
> Upon seeing this patch I considered that I use addresses such as
> alvherre+stuff@something.org  and wondered how could this thing support
> that.  I don't think we want extra parser stuff just to add whatever
> random junk we want to support in email addresses ...

Well, I think the big question is whether we need to honor RFC 5322
(http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
all valid characters:
   http://en.wikipedia.org/wiki/E-mail_address
   * Uppercase and lowercase English letters (a-z, A-Z)   * Digits 0 to 9   * Characters ! # $ % & ' * + - / = ? ^ _ `
{| } ~   * Character . (dot, period, full stop) provided that it is not the     first or last character, and provided
alsothat it does not appear two     or more times consecutively.
 

And we don't currently honor most of the special characters, including
plus:
test=> select ts_parse('default', ' first+last@yahoo.com '   );      ts_parse-------------------- (12," ") (1,first)
(12,+)(4,last@yahoo.com) (12," ")(5 rows)
 

Where does this leave us?  Do we add the other characters?  Do we
document that we only allow a limited number of characters for email
addresses?  What is the logic in that?  Do any of these characters
conflict with our tsquery operators?  

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do


Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Well, I think the big question is whether we need to honor RFC 5322
> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
> all valid characters:

>     http://en.wikipedia.org/wiki/E-mail_address

>     * Uppercase and lowercase English letters (a-z, A-Z)
>     * Digits 0 to 9
>     * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
>     * Character . (dot, period, full stop) provided that it is not the
>       first or last character, and provided also that it does not appear two
>       or more times consecutively.

That's an awful lot of special characters.  For the RFC's purposes,
it's not hard to be flexible because in an email message there is
external context telling where to expect an address.  I think if we
tried to allow all of those in email addresses in tsearch, we'd have
"email addresses" gobbling up a whole lot of adjacent text, to nobody's
benefit.

I can see the case for adding "+" because that's fairly common as Alvaro
notes, but I think we should be very circumspect about going farther.
        regards, tom lane


Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Well, I think the big question is whether we need to honor RFC 5322
> > (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
> > all valid characters:
> 
> >     http://en.wikipedia.org/wiki/E-mail_address
> 
> >     * Uppercase and lowercase English letters (a-z, A-Z)
> >     * Digits 0 to 9
> >     * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
> >     * Character . (dot, period, full stop) provided that it is not the
> >       first or last character, and provided also that it does not appear two
> >       or more times consecutively.
> 
> That's an awful lot of special characters.  For the RFC's purposes,
> it's not hard to be flexible because in an email message there is
> external context telling where to expect an address.  I think if we
> tried to allow all of those in email addresses in tsearch, we'd have
> "email addresses" gobbling up a whole lot of adjacent text, to nobody's
> benefit.
> 
> I can see the case for adding "+" because that's fairly common as Alvaro
> notes, but I think we should be very circumspect about going farther.

OK, I can add '+' using Teodor's patch as a guide, and document which
characters we support, and that we don't support all of them.  What
about the binary upgrade issue?  I am now worried that maybe we should
back out the patch and just document our restrictions.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do


Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Steve Atkins
Дата:
On Mar 12, 2010, at 5:18 PM, Tom Lane wrote:

> Bruce Momjian <bruce@momjian.us> writes:
>> Well, I think the big question is whether we need to honor RFC 5322
>> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
>> all valid characters:
>
>>    http://en.wikipedia.org/wiki/E-mail_address
>
>>    * Uppercase and lowercase English letters (a-z, A-Z)
>>    * Digits 0 to 9
>>    * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
>>    * Character . (dot, period, full stop) provided that it is not the
>>      first or last character, and provided also that it does not appear two
>>      or more times consecutively.
>
> That's an awful lot of special characters.  For the RFC's purposes,
> it's not hard to be flexible because in an email message there is
> external context telling where to expect an address.  I think if we
> tried to allow all of those in email addresses in tsearch, we'd have
> "email addresses" gobbling up a whole lot of adjacent text, to nobody's
> benefit.
>
> I can see the case for adding "+" because that's fairly common as Alvaro
> notes, but I think we should be very circumspect about going farther.

I've been working with recognizing email addresses in text for
years, with many millions of documents processed. Recognizing
them in text is a very different problem to validating them or sanitizing
them. Using the RFC spec to match things that "might be an email
address" isn't a great idea in the wild, so +1 on the circumspect.

I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts
of "real" email addresses in free text in the wild, without getting being
too prone to grab things that just look vaguely like email addresses. Obviously
there are some things it'll match that aren't email addresses, and some
email addresses it won't match, but for indexing it's been really pretty
good when combined with a good regex for domain parts[1].

Cheers, Steve

[1]
([a-z0-9_][^<\"@\\s]{0,80})@([a-z0-9._-]{0,252}\\.(?:[a-z]{2}|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|travel|jobs|mobi|tel|cat)

(Before you point out all the ways that differs from the RFC specs for
an email address, yes, I know, but that's the point. Real world usage
is not the same as RFC spec.) [2]

[2] This is the simplified version - the full version is marginally more
selective, at the expense of being much more complex.




Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> OK, I can add '+' using Teodor's patch as a guide, and document which
> characters we support, and that we don't support all of them.  What
> about the binary upgrade issue?  I am now worried that maybe we should
> back out the patch and just document our restrictions.

The tsearch stuff is supposedly designed to be flexible about that.
It's not really any different from changing your tsearch configuration
midstream.  You might not get matches that you would have liked to get,
but there's not any internal inconsistency.
        regards, tom lane


Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > OK, I can add '+' using Teodor's patch as a guide, and document which
> > characters we support, and that we don't support all of them.  What
> > about the binary upgrade issue?  I am now worried that maybe we should
> > back out the patch and just document our restrictions.
> 
> The tsearch stuff is supposedly designed to be flexible about that.
> It's not really any different from changing your tsearch configuration
> midstream.  You might not get matches that you would have liked to get,
> but there's not any internal inconsistency.

Glad you are not worried.  ;-)

What concerns me is if someone created a tsvector index or stored
tsvector output in a table using the old rules, a new query might not
match the binary-upgraded tsvector stored values.

You are right that internally it is fine.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do


Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

От
Bruce Momjian
Дата:
Steve Atkins wrote:
>
> On Mar 12, 2010, at 5:18 PM, Tom Lane wrote:
>
> > Bruce Momjian <bruce@momjian.us> writes:
> >> Well, I think the big question is whether we need to honor RFC 5322
> >> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
> >> all valid characters:
> >
> >>    http://en.wikipedia.org/wiki/E-mail_address
> >
> >>    * Uppercase and lowercase English letters (a-z, A-Z)
> >>    * Digits 0 to 9
> >>    * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
> >>    * Character . (dot, period, full stop) provided that it is not the
> >>      first or last character, and provided also that it does not appear two
> >>      or more times consecutively.
> >
> > That's an awful lot of special characters.  For the RFC's purposes,
> > it's not hard to be flexible because in an email message there is
> > external context telling where to expect an address.  I think if we
> > tried to allow all of those in email addresses in tsearch, we'd have
> > "email addresses" gobbling up a whole lot of adjacent text, to nobody's
> > benefit.
> >
> > I can see the case for adding "+" because that's fairly common as Alvaro
> > notes, but I think we should be very circumspect about going farther.
>
> I've been working with recognizing email addresses in text for
> years, with many millions of documents processed. Recognizing
> them in text is a very different problem to validating them or sanitizing
> them. Using the RFC spec to match things that "might be an email
> address" isn't a great idea in the wild, so +1 on the circumspect.
>
> I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts
> of "real" email addresses in free text in the wild, without getting being
> too prone to grab things that just look vaguely like email addresses. Obviously
> there are some things it'll match that aren't email addresses, and some
> email addresses it won't match, but for indexing it's been really pretty
> good when combined with a good regex for domain parts[1].

OK, based on your experience, I think we have gone far enough by
allowing underscores.  I have applied the attached patch to document
what symbols we do allow.

Just for thrills, I want to point out that even the description is not
accurate.  Look what happens when a dash follows an underscore:

    test=> select ts_parse('default', ' a-b_c@yahoo.com '   );
          ts_parse
    ---------------------
     (12," ")
     (4,a-b_c@yahoo.com)
     (12," ")
    (3 rows)

    test=> select ts_parse('default', ' a-b-_c@yahoo.com '   );
        ts_parse
    -----------------
     (12," ")
     (16,a-b)
     (11,a)
     (12,-)
     (11,b)
     (12,-_)
     (4,c@yahoo.com)
     (12," ")
    (8 rows)

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
Index: doc/src/sgml/textsearch.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.53
diff -c -c -r1.53 textsearch.sgml
*** doc/src/sgml/textsearch.sgml    14 Aug 2009 14:53:20 -0000    1.53
--- doc/src/sgml/textsearch.sgml    13 Mar 2010 03:03:24 -0000
***************
*** 1943,1948 ****
--- 1943,1955 ----
      languages, token types <literal>word</> and <literal>asciiword</>
      should be treated alike.
     </para>
+
+    <para>
+     <literal>email</> does not support all valid email characters as
+     defined by RFC 5322.  Specifically, the only non-alphanumeric
+     characters supported for email user names are period, dash, and
+     underscore.
+    </para>
    </note>

    <para>