Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores
Дата
Msg-id 201003130308.o2D38U204192@momjian.us
обсуждение исходный текст
Ответ на Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores  (Steve Atkins <steve@blighty.com>)
Список pgsql-hackers
Steve Atkins wrote:
>
> On Mar 12, 2010, at 5:18 PM, Tom Lane wrote:
>
> > Bruce Momjian <bruce@momjian.us> writes:
> >> Well, I think the big question is whether we need to honor RFC 5322
> >> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
> >> all valid characters:
> >
> >>    http://en.wikipedia.org/wiki/E-mail_address
> >
> >>    * Uppercase and lowercase English letters (a-z, A-Z)
> >>    * Digits 0 to 9
> >>    * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
> >>    * Character . (dot, period, full stop) provided that it is not the
> >>      first or last character, and provided also that it does not appear two
> >>      or more times consecutively.
> >
> > That's an awful lot of special characters.  For the RFC's purposes,
> > it's not hard to be flexible because in an email message there is
> > external context telling where to expect an address.  I think if we
> > tried to allow all of those in email addresses in tsearch, we'd have
> > "email addresses" gobbling up a whole lot of adjacent text, to nobody's
> > benefit.
> >
> > I can see the case for adding "+" because that's fairly common as Alvaro
> > notes, but I think we should be very circumspect about going farther.
>
> I've been working with recognizing email addresses in text for
> years, with many millions of documents processed. Recognizing
> them in text is a very different problem to validating them or sanitizing
> them. Using the RFC spec to match things that "might be an email
> address" isn't a great idea in the wild, so +1 on the circumspect.
>
> I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts
> of "real" email addresses in free text in the wild, without getting being
> too prone to grab things that just look vaguely like email addresses. Obviously
> there are some things it'll match that aren't email addresses, and some
> email addresses it won't match, but for indexing it's been really pretty
> good when combined with a good regex for domain parts[1].

OK, based on your experience, I think we have gone far enough by
allowing underscores.  I have applied the attached patch to document
what symbols we do allow.

Just for thrills, I want to point out that even the description is not
accurate.  Look what happens when a dash follows an underscore:

    test=> select ts_parse('default', ' a-b_c@yahoo.com '   );
          ts_parse
    ---------------------
     (12," ")
     (4,a-b_c@yahoo.com)
     (12," ")
    (3 rows)

    test=> select ts_parse('default', ' a-b-_c@yahoo.com '   );
        ts_parse
    -----------------
     (12," ")
     (16,a-b)
     (11,a)
     (12,-)
     (11,b)
     (12,-_)
     (4,c@yahoo.com)
     (12," ")
    (8 rows)

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
Index: doc/src/sgml/textsearch.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.53
diff -c -c -r1.53 textsearch.sgml
*** doc/src/sgml/textsearch.sgml    14 Aug 2009 14:53:20 -0000    1.53
--- doc/src/sgml/textsearch.sgml    13 Mar 2010 03:03:24 -0000
***************
*** 1943,1948 ****
--- 1943,1955 ----
      languages, token types <literal>word</> and <literal>asciiword</>
      should be treated alike.
     </para>
+
+    <para>
+     <literal>email</> does not support all valid email characters as
+     defined by RFC 5322.  Specifically, the only non-alphanumeric
+     characters supported for email user names are period, dash, and
+     underscore.
+    </para>
    </note>

    <para>

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores
Следующее
От: Bruce Momjian
Дата:
Сообщение: Getting to beta1