Обсуждение: Select all invalid e-mail addresses
I have a database of e-mail addresses. I want to select the email addresses which are not valid: do not contain exactly one @ character, contain ; > < " ' , characters or spaces etc. What is the WHERE clause for this ? Andrus.
This might be handy: http://www.databasejournal.com/img/email_val.sql > -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Andrus > Sent: Wednesday, October 19, 2005 11:12 AM > To: pgsql-general@postgresql.org > Subject: [GENERAL] Select all invalid e-mail addresses > > I have a database of e-mail addresses. > > I want to select the email addresses which are not valid: > > do not contain exactly one @ character, > contain ; > < " ' , characters or spaces etc. > > What is the WHERE clause for this ? > > Andrus. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend
Andrus wrote: > I have a database of e-mail addresses. > > I want to select the email addresses which are not valid: > > do not contain exactly one @ character, > contain ; > < " ' , characters or spaces etc. > > What is the WHERE clause for this ? Please see a long, detailed thread in the archives titled "Email Verification Regular Expression" on Sept 7, 2005. -- Guy Rouillier
On Wed, Oct 19, 2005 at 09:12:16PM +0300, Andrus wrote: > I want to select the email addresses which are not valid: > > do not contain exactly one @ character, > contain ; > < " ' , characters or spaces etc. The rules that define a valid email address are more complex than most people realize, and even if an address is syntactically valid that doesn't mean it's valid in the sense that you can deliver mail to it. Whatever method you end up using, be sure to understand its limitations. One possibility would be to write a plperlu function that uses the Email::Valid module. Here's a trivial example; see the Email::Valid documentation to learn about its full capabilities: CREATE FUNCTION is_valid_email(text) RETURNS boolean AS $$ use Email::Valid; return Email::Valid->address($_[0]) ? "true" : "false"; $$ LANGUAGE plperlu IMMUTABLE STRICT; You could then do something like: SELECT * FROM foo WHERE NOT is_valid_email(email_address); Again, be aware that passing this or any other test doesn't necessarily mean that an address is truly valid -- it's just an attempt to identify addresses that are obviously bogus. -- Michael Fuhr
On 19.10.2005 21:18, Michael Fuhr wrote: > One possibility would be to write a plperlu function that uses the > Email::Valid module. Here's a trivial example; see the Email::Valid > documentation to learn about its full capabilities: ..and if you don't mind installing pl/php, you can use this function: http://hannes.imos.net/validmail.html It performs a MX-lookup, which IMHO is the best way to check for validity. -- Regards, Hannes Dorbath
"Andrus" <eetasoft@online.ee> writes: > I have a database of e-mail addresses. > > I want to select the email addresses which are not valid: > > do not contain exactly one @ character, > contain ; > < " ' , characters or spaces etc. > > What is the WHERE clause for this ? > There was a thread here not so long ago about matching valid email addresses. It's not so simple. You probably want to do a regex match - e.g. select ... where email ~ '<regex>' However the regex for all valid email possibilities is *VERY* complex. see: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html You should probably search the maillist archives. ISTR that there were some suggestions on how one might simplify the search space.
Hannes Dorbath <light@theendofthetunnel.de> writes: > On 19.10.2005 21:18, Michael Fuhr wrote: >> One possibility would be to write a plperlu function that uses the >> Email::Valid module. Here's a trivial example; see the Email::Valid >> documentation to learn about its full capabilities: > > ..and if you don't mind installing pl/php, you can use this function: > > http://hannes.imos.net/validmail.html > > It performs a MX-lookup, which IMHO is the best way to check for validity. But that's expensive and slow, and doesn't tell you whether the user part of the address is valid (and in general, there's no way to determine that short of actually sending a message). So what's the point? -Doug
On 20.10.2005 14:00, Douglas McNaught wrote: > But that's expensive and slow Sure, that isn't meant to be used in a WHERE condition on a 100k row table.. more to be bound via check constraint on a user table, so incomming data is validated. > and doesn't tell you whether the user part of the address is valid (and in general, there's no way to > determine that short of actually sending a message). So what's the > point? The point is to validate as good as possible, and as you said yourself, the user part can't be validated further. -- Regards, Hannes Dorbath
""Guy Rouillier"" <guyr@masergy.com> wrote in message news:CC1CF380F4D70844B01D45982E671B239E8BE9@mtxexch01.add0.masergy.com... > Andrus wrote: >> I have a database of e-mail addresses. >> >> I want to select the email addresses which are not valid: >> >> do not contain exactly one @ character, >> contain ; > < " ' , characters or spaces etc. >> >> What is the WHERE clause for this ? > > Please see a long, detailed thread in the archives titled "Email > Verification Regular Expression" on Sept 7, 2005. Guy Rouillier, thank you. I have emails in CHARACTER(60) type columns in database, total 3000 emails. I need to check email addresses for most frequent typos before send. I have only plpgsql language installed, no perl. From this thread I got the regular expression /^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/How Ican use this in where clause ? I havent never used regular expressionsin Postgres.How to exclude top-level domain namesfrom this regex ?Andrus.
On 10/20/05, Douglas McNaught <doug@mcnaught.org> wrote:
That's why I think the better term for this is "well formed". "Validity" can only be determined by sending to it, but you can tell if an address at least conforms to the specs programmatically. In the end, when talking about a "valid" address in this context, that is what most people are talking about.
The point is to weed out malformed email addresses, just like you would enforce any other data formatting standards in other types of data.
> It performs a MX-lookup, which IMHO is the best way to check for validity.
But that's expensive and slow, and doesn't tell you whether the user
part of the address is valid (and in general, there's no way to
determine that short of actually sending a message). So what's the
point?
-Doug
That's why I think the better term for this is "well formed". "Validity" can only be determined by sending to it, but you can tell if an address at least conforms to the specs programmatically. In the end, when talking about a "valid" address in this context, that is what most people are talking about.
The point is to weed out malformed email addresses, just like you would enforce any other data formatting standards in other types of data.
On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote: > >From this thread I got the regular expression [snipped] Note that that regular expression, which appears to be validating TLDs as well, is incredibly fragile. John Klensin has actually written an RFC about this very problem. Among other problems, what do you do when a country code ceases to be? (There's a similar problem that the naming bodies struggke with from time to time.) I suggest that if you want to validate TLDs, you pull them off when you write the data in your database, and use a lookup table to make sure they're valid (you can keep the table up to date regularly by checking the official IANA registry for them). At least that way you don't have to change a regex every time ICANN decides to add another TLD. (The regex is wrong anyway, I think: it doesn't have .mobi, which has been announced although isn't taking registrations yet, and it doesn't appear to have arpa, either.) A -- Andrew Sullivan | ajs@crankycanuck.ca Information security isn't a technological problem. It's an economics problem. --Bruce Schneier
On Thu, Oct 20, 2005 at 11:22:25AM -0400, Brian Mathis wrote: > That's why I think the better term for this is "well formed". "Validity" can > only be determined by sending to it, but you can tell if an address at least In fact, it can only be determined by sending to it over and over again, because whether a mail address works may change over time (and may have nothing to do with the poor schmuck whose email administrators don't know how to spell "MX record"). A -- Andrew Sullivan | ajs@crankycanuck.ca This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie
"Andrew Sullivan" <ajs@crankycanuck.ca> wrote in message > I suggest that if you want to validate TLDs, you pull them off when > you write the data in your database, and use a lookup table to make > sure they're valid (you can keep the table up to date regularly by > checking the official IANA registry for them). At least that way you > don't have to change a regex every time ICANN decides to add another > TLD. (The regex is wrong anyway, I think: it doesn't have .mobi, > which has been announced although isn't taking registrations yet, and > it doesn't appear to have arpa, either.) Andrew, thank you. I understand now that I do'nt want to validate TLDs at all. I have an existing database of e-mail addresses. Those addesses are copied from letters so they contain < > chars, points, commas etc. stupid characters. Sometimes two email addresses are copied to this field (contains two @ sings, spaces or commas). Sometimes web addresses starting with www. and without @ are present in email column. I want simply to allow user to view those addresses and make manual corrections before starting large mailing session in night. How to write a WHERE clause which selects e-mail addresses which are surely wrong ? Andrus. > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
On Thu, Oct 20, 2005 at 11:52:40AM -0400, Andrew Sullivan wrote: > On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote: > > >From this thread I got the regular expression > > [snipped] > > Note that that regular expression, which appears to be validating > TLDs as well, is incredibly fragile. John Klensin has actually > written an RFC about this very problem. Among other problems, what > do you do when a country code ceases to be? (There's a similar > problem that the naming bodies struggke with from time to time.) > > I suggest that if you want to validate TLDs, you pull them off when > you write the data in your database, and use a lookup table to make > sure they're valid (you can keep the table up to date regularly by > checking the official IANA registry for them). At least that way you > don't have to change a regex every time ICANN decides to add another > TLD. You need to maintain the data, certainly. To argue that it must be in a table to be maintained is, well, wrong. My preference would be to keep it in a table and regenerate the regex periodically, and in the application layer I do exactly that, but to try and do that in a check constraint would be painful. A cleaner approach would be to have a regex that checks for general syntax and extracts the TLD, which is then compared to a lookup table, perhaps, but that adds a lot of complexity for no real benefit. > (The regex is wrong anyway, I think: it doesn't have .mobi, > which has been announced although isn't taking registrations yet, and > it doesn't appear to have arpa, either.) While there are valid deliverable email addresses in .arpa, you really don't want to be accepting them from end users... Cheers, Steve
On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote: > "Andrew Sullivan" <ajs@crankycanuck.ca> wrote in message > > > I suggest that if you want to validate TLDs, you pull them off when > > you write the data in your database, and use a lookup table to make > > sure they're valid (you can keep the table up to date regularly by > > checking the official IANA registry for them). At least that way you > > don't have to change a regex every time ICANN decides to add another > > TLD. (The regex is wrong anyway, I think: it doesn't have .mobi, > > which has been announced although isn't taking registrations yet, and > > it doesn't appear to have arpa, either.) > > Andrew, thank you. > > I understand now that I do'nt want to validate TLDs at all. > > I have an existing database of e-mail addresses. Those addesses are copied > from letters so they contain < > chars, points, commas etc. stupid > characters. > Sometimes two email addresses are copied to this field (contains two @ > sings, spaces or commas). Sometimes web addresses starting with www. and > without @ are present in email column. > I want simply to allow user to view those addresses and make manual > corrections before starting large mailing session in night. > > How to write a WHERE clause which selects e-mail addresses which > are surely wrong ? ... WHERE email !~ '...insert previously mentioned regex here...'; Cheers, Steve
On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote: > How to write a WHERE clause which selects e-mail addresses which > are surely wrong ? Then I think the validating function someone else sent here (<http://www.databasejournal.com/img/email_val.sql>) is a good start. You probably want the opposite behaviour -- emailinvalidate(), I guess -- but that seems like a good "obviously wrong" tester. It might not be fast, though -- that loop at the special character check looks pretty painful. A -- Andrew Sullivan | ajs@crankycanuck.ca The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun
Interesting article: http://coveryourasp.com/ValidateEmail.asp See also: http://search.cpan.org/~cwest/Email-Address-1.80/lib/Email/Address.pm http://www.faqs.org/rfcs/rfc2822.html http://docs.python.org/lib/module-rfc822.html > -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Steve Atkins > Sent: Thursday, October 20, 2005 12:35 PM > To: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Select all invalid e-mail addresses > > On Thu, Oct 20, 2005 at 11:52:40AM -0400, Andrew Sullivan wrote: > > On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote: > > > >From this thread I got the regular expression > > > > [snipped] > > > > Note that that regular expression, which appears to be validating > > TLDs as well, is incredibly fragile. John Klensin has actually > > written an RFC about this very problem. Among other problems, what > > do you do when a country code ceases to be? (There's a similar > > problem that the naming bodies struggke with from time to time.) > > > > I suggest that if you want to validate TLDs, you pull them off when > > you write the data in your database, and use a lookup table to make > > sure they're valid (you can keep the table up to date regularly by > > checking the official IANA registry for them). At least that way you > > don't have to change a regex every time ICANN decides to add another > > TLD. > > You need to maintain the data, certainly. To argue that it must > be in a table to be maintained is, well, wrong. My preference would > be to keep it in a table and regenerate the regex periodically, and > in the application layer I do exactly that, but to try and do that > in a check constraint would be painful. A cleaner approach would > be to have a regex that checks for general syntax and extracts the > TLD, which is then compared to a lookup table, perhaps, but that > adds a lot of complexity for no real benefit. > > > (The regex is wrong anyway, I think: it doesn't have .mobi, > > which has been announced although isn't taking registrations yet, and > > it doesn't appear to have arpa, either.) > > While there are valid deliverable email addresses in .arpa, you really > don't want to be accepting them from end users... > > Cheers, > Steve > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
On Thu, Oct 20, 2005 at 12:34:39PM -0700, Steve Atkins wrote: > > While there are valid deliverable email addresses in .arpa, you really > don't want to be accepting them from end users... You know, as someone who has been bitten hundreds of times by the decision of some application designer who thought s/he knew better than I what my email address could possibly be, I respectfully submit that you're mistaken. We call it a bug when other databases accept dates like '0000-00-00'; but we'd just as surely call it a bug if PostgreSQL refused to accept valid leap year dates or leap seconds. It's one thing to say you should not accept known-bad data; it's quite another to refuse data that is improbable but nevertheless perfectly good. A -- Andrew Sullivan | ajs@crankycanuck.ca I remember when computers were frustrating because they *did* exactly what you told them to. That actually seems sort of quaint now. --J.D. Baldwin
>> How to write a WHERE clause which selects e-mail addresses which >> are surely wrong ? > > ... WHERE email !~ '...insert previously mentioned regex here...'; Steve, thank you. I tried SELECT email FROM customer WHERE email !~ '/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/' but got an error ERROR: invalid regular expression: invalid character range I'm using "PostgreSQL 8.0.2 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2 (mingw-special)" Andrus.
On Fri, Oct 21, 2005 at 08:15:23PM +0300, Andrus wrote: > I tried > > SELECT email FROM customer > WHERE email !~ > '/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/' > > but got an error > > ERROR: invalid regular expression: invalid character range Aside from the fact that this regular expression is semantically wrong, it has a few other problems: * A hyphen (-) must come first or last in a character class if you want it interpreted literally instead of as part of a range specification. test=> SELECT 'abc' ~ '[a-z0-9-_]'; -- WRONG ERROR: invalid regular expression: invalid character range test=> SELECT 'abc' ~ '[a-z0-9_-]'; ?column? ---------- t (1 row) * Regular expressions in PostgreSQL don't use delimiters like / at the beginning and end of the expression. test=> SELECT 'abc' ~ '/abc/'; -- WRONG ?column? ---------- f (1 row) test=> SELECT 'abc' ~ 'abc'; ?column? ---------- t (1 row) * If you use single quotes around the regular expression then you need to escape backslashes that should be part of the regular expression; otherwise the backslash will be parsed by the string parser before the string is used as a regular expression and you'll get unexpected results. In other words, there's an extra layer of string parsing that you have to allow for. In 8.0 and later you can avoid this by using dollar quotes. test=> SELECT 'abc' ~ 'a\.c'; -- WRONG ?column? ---------- t (1 row) test=> SELECT 'abc' ~ 'a\\.c'; ?column? ---------- f (1 row) test=> SELECT 'a.c' ~ 'a\\.c'; ?column? ---------- t (1 row) test=> SELECT 'abc' ~ $$a\.c$$; ?column? ---------- f (1 row) test=> SELECT 'a.c' ~ $$a\.c$$; ?column? ---------- t (1 row) -- Michael Fuhr
On Fri, Oct 21, 2005 at 08:15:23PM +0300, Andrus wrote: > >> How to write a WHERE clause which selects e-mail addresses which > >> are surely wrong ? > > > > ... WHERE email !~ '...insert previously mentioned regex here...'; > > Steve, > > thank you. > > I tried [snip] SELECT email FROM customer WHERE email !~* '^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$' ...should be closer. Fixes one typo in the range, uses valid pg format regex, rather than perl regex and had a couple of pedant-fixes in the TLDs supported. It's syntactically correct, and appears to do the right thing on my production DB here (which conincedentally has a customer table with an email field :)), but you should make sure you understand what the regex actually does. Cheers, Steve
On Fri, Oct 21, 2005 at 11:49:54AM -0700, Steve Atkins wrote: > SELECT email FROM customer > WHERE email !~* > '^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$' > > ...should be closer. Fixes one typo in the range, uses valid pg format regex, rather > than perl regex and had a couple of pedant-fixes in the TLDs supported. > > It's syntactically correct, and appears to do the right thing on my production > DB here (which conincedentally has a customer table with an email field :)) The backslashes should be escaped or the regular expression should be quoted with dollar quotes (8.0 and later) -- otherwise the string parser converts "\." to ".", which matches anything. For example, the above regular expression considers the following address valid: foo@example?com Even with that correction the regular expression is still wrong, especially the ^[^@]*@ part at the beginning. See this group's archives and numerous other sources for further discussion on this topic. -- Michael Fuhr
> SELECT email FROM customer > WHERE email !~* > '^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$' > > ...should be closer. Fixes one typo in the range, uses valid pg format > regex, rather > than perl regex and had a couple of pedant-fixes in the TLDs supported. > > It's syntactically correct, and appears to do the right thing on my > production > DB here (which conincedentally has a customer table with an email field > :)), but > you should make sure you understand what the regex actually does. Steve, thank you again. I applied Michael hint about dollar quoting to this and tried create temp table customer ( email char(60)); insert into customer values( 'steve@blighty.com'); SELECT email FROM customer WHERE email !~* $$ ^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$ $$ but this classifies e-mail address steve@blighty.com as invalid (select returns it). The same result are without dollar quoting, using your original select. Andrus.
On Mon, Oct 24, 2005 at 09:02:26PM +0300, Andrus wrote: > I applied Michael hint about dollar quoting to this and tried > > create temp table customer ( email char(60)); > insert into customer values( 'steve@blighty.com'); > SELECT email FROM customer WHERE email !~* > $$ > ^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$ > $$ > > but this classifies e-mail address steve@blighty.com as invalid (select > returns it). The same result are without dollar quoting, using your original > select. There are at least two problems: 1. Since you're storing the email address as char(60), in some cases it'll be padded with spaces up to 60 characters. This appears to be one of those cases: SELECT 'foo'::char(60) ~ '^foo$'; ?column? ---------- f (1 row) test=> SELECT 'foo'::char(60) ~ '^foo {57}$'; ?column? ---------- t (1 row) 2. Everything in the quoted string is part of the regular expression, including the embedded newlines immediately after the open quote and before the close quote. test=> SELECT 'foo'::text ~ $$ test$> ^foo$ test$> $$; ?column? ---------- f (1 row) test=> SELECT 'foo'::text ~ $_$^foo$$_$; ?column? ---------- t (1 row) Note the need to quote with something other than $$ ($_$ in this case) because of the $ that's part of the regular expression. Otherwise you'd get this: test=> SELECT 'foo'::text ~ $$^foo$$$; ERROR: syntax error at or near "$" at character 30 LINE 1: SELECT 'foo'::text ~ $$^foo$$$; ^ Suggestions: use text or varchar for the email address, don't embed newlines in the regular expression, and if you use dollar quotes and the regular expression ends with a dollar sign then quote with a character sequence other than $$. -- Michael Fuhr
> Suggestions: use text or varchar for the email address, don't embed > newlines in the regular expression, and if you use dollar quotes > and the regular expression ends with a dollar sign then quote with > a character sequence other than $$. Michael, thank you. Excellent! I'm afraid that using VARCHAR instead on CHAR may cause some parts of my application to stop working. So I used TRIM function. This regex allows email addresses containing two dots without any letters, like eeta..soft@online.ee I havent seen any email of such kind. Andrus. create temp table customer ( email char(60)) on commit drop; insert into customer values( 'eeta..soft@online.ee'); SELECT email FROM customer WHERE email!='' and email IS NOT NULL and TRIM(email) !~* $_$^[^@]*@(?:[^@]*\.)?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$$_$
On Tue, Oct 25, 2005 at 11:20:53AM +0300, Andrus wrote: > This regex allows email addresses containing two dots without any letters, > like eeta..soft@online.ee > I havent seen any email of such kind. That's because the regular expression is wrong: it simply checks the local part for zero or more non-@ characters instead of checking against the RFC822/RFC2822 specification. Use a search engine to find a more complete regular expression (beware: it's long). -- Michael Fuhr
On Tue, Oct 25, 2005 at 09:09:44AM -0600, Michael Fuhr wrote: > On Tue, Oct 25, 2005 at 11:20:53AM +0300, Andrus wrote: > > This regex allows email addresses containing two dots without any letters, > > like eeta..soft@online.ee > > I havent seen any email of such kind. > > That's because the regular expression is wrong: it simply checks > the local part for zero or more non-@ characters instead of checking > against the RFC822/RFC2822 specification. Use a search engine to > find a more complete regular expression (beware: it's long). eeta..soft@online.ee is a perfectly functional email address, despite not being in dot-atom form, so technically in violation of RFC 2822. There are few constraints on the local part of an email address, and those constraints are often violated in practice, and cause no problems. I do data analysis on email addresses all day, every day. I'm fully aware of RFC 2822 constraints, and I'm also aware that the correlation between them and the real world is high, but not absolute. If you were using this to validate email software that would be a different thing, but if you're actually working in the real world with real world data and are actually concerned about finding email addresses that are likely to be incorrect (rather than punishing users with noc RFC 2822 compliant email addresses) then looking at the local-part in much detail is really not useful. Cheers, Steve
>> This regex allows email addresses containing two dots without any >> letters, >> like eeta..soft@online.ee > > That's because the regular expression is wrong: it simply checks > the local part for zero or more non-@ characters instead of checking > against the RFC822/RFC2822 specification. Use a search engine to > find a more complete regular expression (beware: it's long). Michael, thank you. I found correct regexp from http://www.twilightsoul.com/Domains/Voyager/DeveloperVision/BestPracticesPatterns/EmailAddresses/tabid/134/Default.aspx?PageContentID=2 but this needs to be converted to Postgres. It causes the famuous ERROR: invalid regular expression: invalid character range. Since text editor find/replace cannot be used to convert it it is probably not reasonable to waste time trying to make the following code to work in Postgres. Andrus. select email from customer where email!='' and trim(email) !~* $_$[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\ xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\x ff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015 "]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\ xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80 -\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]* )*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\ \\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\ x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x8 0-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n \015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x 80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^ \x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040 \t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([ ^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\ \\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\ x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80- \xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015() ]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\ x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04 0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\ n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\ 015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?! [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\ ]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\ x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01 5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:". \\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff] )|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^ ()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0 15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][ ^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\ n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\ x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(? :(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80- \xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\t]* (?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015 ()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015() ]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0 40)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\ [^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\ xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]* )*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80 -\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x 80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t ]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\ \[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff]) *\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x 80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80 -\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015( )]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\ \x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*@[\040\t ]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0 15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015 ()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^( \040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]| \\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80 -\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015() ]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x 80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^ \x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040 \t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:". \\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff ])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\ \x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x 80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015 ()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\ \\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^ (\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000- \037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\ n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]| \([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\)) [^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff \n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*( ?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\ 000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\ xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*) *\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\x ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80- \xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*) *(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\ ]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\] )[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80- \xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*( ?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80 -\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)< >@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x8 0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?: \([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()] *(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*) *\)[\040\t]*)*)*>)$_$
On Tue, 2005-10-25 at 13:01, Andrus wrote: > >> This regex allows email addresses containing two dots without any > >> letters, > >> like eeta..soft@online.ee > > > > That's because the regular expression is wrong: it simply checks > > the local part for zero or more non-@ characters instead of checking > > against the RFC822/RFC2822 specification. Use a search engine to > > find a more complete regular expression (beware: it's long). > > Michael, thank you. > I found correct regexp from > http://www.twilightsoul.com/Domains/Voyager/DeveloperVision/BestPracticesPatterns/EmailAddresses/tabid/134/Default.aspx?PageContentID=2 > > but this needs to be converted to Postgres. It causes the famuous ERROR: > invalid regular expression: invalid character range. > Since text editor find/replace cannot be used to convert it it is probably > not reasonable to waste time trying to make the following code to work in > Postgres. > > Andrus. PERL REGEX SNIPPED. That's because it's a perl regex, not a posix or sql regex. IF you wrapped it in a plperl function, then you could use it. Anyone know if the PCRE library can handle this thing? I guess I could try it myself.