Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores
| От | Steve Atkins | 
|---|---|
| Тема | Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores | 
| Дата | |
| Msg-id | 207954FB-A0ED-4F85-99C6-2139893232A1@blighty.com обсуждение исходный текст | 
| Ответ на | Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores (Tom Lane <tgl@sss.pgh.pa.us>) | 
| Ответы | Re: Re: [BUGS] BUG #5021: ts_parse doesn't
 recognize email addresses with underscores | 
| Список | pgsql-hackers | 
On Mar 12, 2010, at 5:18 PM, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: >> Well, I think the big question is whether we need to honor RFC 5322 >> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are >> all valid characters: > >> http://en.wikipedia.org/wiki/E-mail_address > >> * Uppercase and lowercase English letters (a-z, A-Z) >> * Digits 0 to 9 >> * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~ >> * Character . (dot, period, full stop) provided that it is not the >> first or last character, and provided also that it does not appear two >> or more times consecutively. > > That's an awful lot of special characters. For the RFC's purposes, > it's not hard to be flexible because in an email message there is > external context telling where to expect an address. I think if we > tried to allow all of those in email addresses in tsearch, we'd have > "email addresses" gobbling up a whole lot of adjacent text, to nobody's > benefit. > > I can see the case for adding "+" because that's fairly common as Alvaro > notes, but I think we should be very circumspect about going farther. I've been working with recognizing email addresses in text for years, with many millions of documents processed. Recognizing them in text is a very different problem to validating them or sanitizing them. Using the RFC spec to match things that "might be an email address" isn't a great idea in the wild, so +1 on the circumspect. I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts of "real" email addresses in free text in the wild, without getting being too prone to grab things that just look vaguely like email addresses. Obviously there are some things it'll match that aren't email addresses, and some email addresses it won't match, but for indexing it's been really pretty good when combined with a good regex for domain parts[1]. Cheers, Steve [1] ([a-z0-9_][^<\"@\\s]{0,80})@([a-z0-9._-]{0,252}\\.(?:[a-z]{2}|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|travel|jobs|mobi|tel|cat) (Before you point out all the ways that differs from the RFC specs for an email address, yes, I know, but that's the point. Real world usage is not the same as RFC spec.) [2] [2] This is the simplified version - the full version is marginally more selective, at the expense of being much more complex.
В списке pgsql-hackers по дате отправления: