Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

Поиск
Список
Период
Сортировка
От Steve Atkins
Тема Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores
Дата
Msg-id 207954FB-A0ED-4F85-99C6-2139893232A1@blighty.com
обсуждение исходный текст
Ответ на Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores
Список pgsql-hackers
On Mar 12, 2010, at 5:18 PM, Tom Lane wrote:

> Bruce Momjian <bruce@momjian.us> writes:
>> Well, I think the big question is whether we need to honor RFC 5322
>> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
>> all valid characters:
>
>>    http://en.wikipedia.org/wiki/E-mail_address
>
>>    * Uppercase and lowercase English letters (a-z, A-Z)
>>    * Digits 0 to 9
>>    * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
>>    * Character . (dot, period, full stop) provided that it is not the
>>      first or last character, and provided also that it does not appear two
>>      or more times consecutively.
>
> That's an awful lot of special characters.  For the RFC's purposes,
> it's not hard to be flexible because in an email message there is
> external context telling where to expect an address.  I think if we
> tried to allow all of those in email addresses in tsearch, we'd have
> "email addresses" gobbling up a whole lot of adjacent text, to nobody's
> benefit.
>
> I can see the case for adding "+" because that's fairly common as Alvaro
> notes, but I think we should be very circumspect about going farther.

I've been working with recognizing email addresses in text for
years, with many millions of documents processed. Recognizing
them in text is a very different problem to validating them or sanitizing
them. Using the RFC spec to match things that "might be an email
address" isn't a great idea in the wild, so +1 on the circumspect.

I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts
of "real" email addresses in free text in the wild, without getting being
too prone to grab things that just look vaguely like email addresses. Obviously
there are some things it'll match that aren't email addresses, and some
email addresses it won't match, but for indexing it's been really pretty
good when combined with a good regex for domain parts[1].

Cheers, Steve

[1]
([a-z0-9_][^<\"@\\s]{0,80})@([a-z0-9._-]{0,252}\\.(?:[a-z]{2}|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|travel|jobs|mobi|tel|cat)

(Before you point out all the ways that differs from the RFC specs for
an email address, yes, I know, but that's the point. Real world usage
is not the same as RFC spec.) [2]

[2] This is the simplified version - the full version is marginally more
selective, at the expense of being much more complex.




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores