Обсуждение: BUG #5021: ts_parse doesn't recognize email addresses with underscores
The following bug has been logged online: Bug reference: 5021 Logged by: Dan O'Hara Email address: danarasoftware@gmail.com PostgreSQL version: 8.3.7 Operating system: win32 Description: ts_parse doesn't recognize email addresses with underscores Details: In the following example, select distinct token as email from ts_parse('default', ' first_last@yahoo.com ' ) where tokid = 4 ts_parse returns last@yahoo.com rather than first_last@yahoo.com It seems that any text prior to the underscore is truncated. If the portion following the underscore is only numeric, such as this example, select distinct token as email from ts_parse('default', ' bill_2000@yahoo.com ' ) where tokid = 4 then ts_parse returns nothing at all. section 3.2.3 of RFC 5322 indicates that underscores are valid characters in an email address. http://tools.ietf.org/html/rfc5322
On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware@gmail.com> wrot= e: > > The following bug has been logged online: > > Bug reference: =A0 =A0 =A05021 > Logged by: =A0 =A0 =A0 =A0 =A0Dan O'Hara > Email address: =A0 =A0 =A0danarasoftware@gmail.com > PostgreSQL version: 8.3.7 > Operating system: =A0 win32 > Description: =A0 =A0 =A0 =A0ts_parse doesn't recognize email addresses wi= th > underscores > Details: > > In the following example, > > select distinct token as email > from ts_parse('default', ' first_last@yahoo.com ' =A0 ) > where tokid =3D 4 > > ts_parse returns last@yahoo.com rather than first_last@yahoo.com =A0It se= ems > that any text prior to the underscore is truncated. =A0If the portion > following the underscore is only numeric, such as this example, > > select distinct token as email > from ts_parse('default', ' bill_2000@yahoo.com ' =A0 ) > where tokid =3D 4 > > then ts_parse returns nothing at all. > > section 3.2.3 of RFC 5322 indicates that underscores are valid characters= in > an email address. > > http://tools.ietf.org/html/rfc5322 I don't think this has much to do with email addresses. If you do: select token from ts_parse('a_b'); ...you get three tokens. In your case you're pulling out the fourth token, but some of your examples don't have four tokens, so then you get nothing at all. I'm not real familiar with ts_parse(), but I'm thinking that it doesn't have any special casing for email addresses and is just intended to parse text for full-text-search - in which case splitting on _ is a pretty good algorithm. ...Robert
Thanks for having a look at this bug. According to section 12.8.2 of the postgres manual, ts_parse is supposed to recognize different types of data, one of which (#4) is an email address. The list of recognized data formats for parse can be selected via this quer= y: SELECT * FROM ts_token_type('default'); The example in the bug I reported is valid email address, according to the RFC, but isn't recognized as such by the full text search in postgres. This bug will have a real impact on anybody using ts functions to locate email addresses, as only some of them are found in the query. Regards Dan On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware@gmail.com> wr= ote: >> >> The following bug has been logged online: >> >> Bug reference: =A0 =A0 =A05021 >> Logged by: =A0 =A0 =A0 =A0 =A0Dan O'Hara >> Email address: =A0 =A0 =A0danarasoftware@gmail.com >> PostgreSQL version: 8.3.7 >> Operating system: =A0 win32 >> Description: =A0 =A0 =A0 =A0ts_parse doesn't recognize email addresses w= ith >> underscores >> Details: >> >> In the following example, >> >> select distinct token as email >> from ts_parse('default', ' first_last@yahoo.com ' =A0 ) >> where tokid =3D 4 >> >> ts_parse returns last@yahoo.com rather than first_last@yahoo.com =A0It s= eems >> that any text prior to the underscore is truncated. =A0If the portion >> following the underscore is only numeric, such as this example, >> >> select distinct token as email >> from ts_parse('default', ' bill_2000@yahoo.com ' =A0 ) >> where tokid =3D 4 >> >> then ts_parse returns nothing at all. >> >> section 3.2.3 of RFC 5322 indicates that underscores are valid character= s in >> an email address. >> >> http://tools.ietf.org/html/rfc5322 > > I don't think this has much to do with email addresses. =A0If you do: > > select token from ts_parse('a_b'); > > ...you get three tokens. =A0In your case you're pulling out the fourth > token, but some of your examples don't have four tokens, so then you > get nothing at all. > > I'm not real familiar with ts_parse(), but I'm thinking that it > doesn't have any special casing for email addresses and is just > intended to parse text for full-text-search - in which case splitting > on _ is a pretty good algorithm. > > ...Robert > --=20 ------------------------------------------------------------------- Dan O'Hara Danara Software Systems, Inc. danarasoftware@gmail.com 613 288-8733
Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores
От
Euler Taveira de Oliveira
Дата:
Robert Haas escreveu: > I'm not real familiar with ts_parse(), but I'm thinking that it > doesn't have any special casing for email addresses and is just > intended to parse text for full-text-search - in which case splitting > on _ is a pretty good algorithm. > It is a bug. The tsearch claims to identify types of tokens but it doesn't correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails to recognize an e-mail address. For example, foo+bar@baz.com is a valid e-mail but the function fails to report that. It is not that simple to identify an e-mail address that agrees with RFC. As that code is a state machine, IMHO it decides too early (when it finds _) that that string is not an e-mail address. AFAIR, that's not an one-line fix. euler=# select distinct token as email from ts_parse('default', 'foo.bar@baz.com'); email âââââââââââââââââ foo.bar@baz.com (1 row) euler=# select distinct token as email from ts_parse('default', 'foo+bar@baz.com'); email âââââââââââââ foo + bar@baz.com (3 rows) euler=# select distinct token as email from ts_parse('default', 'foo_bar@baz.com'); email âââââââââââââ foo bar@baz.com _ (3 rows) -- Euler Taveira de Oliveira http://www.timbira.com/
I agree that it isn't easy to determine if given text is a valid email address. As I couldn't use ts_parse, I ended up using a regex, which worked substantially better at pulling out the emails from the text stream. I haven't looked at the code, but perhaps it is possible to do the same thing here? Even a regex that is 99% correct would be better than the current tokenizer which is only right about 80-85% of the time. My workaround looked something like this: select regexp_matches(resumetext,E'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4= }','gi') as email from "Resume" cheers Dan On Thu, Oct 22, 2009 at 3:39 PM, Euler Taveira de Oliveira <euler@timbira.com> wrote: > Robert Haas escreveu: >> I'm not real familiar with ts_parse(), but I'm thinking that it >> doesn't have any special casing for email addresses and is just >> intended to parse text for full-text-search - in which case splitting >> on _ is a pretty good algorithm. >> > It is a bug. The tsearch claims to identify types of tokens but it doesn't > correctly identify any valid e-mail addresses. As Dan stated ts_parse() f= ails > to recognize an e-mail address. For example, foo+bar@baz.com is a valid e= -mail > but the function fails to report that. > > It is not that simple to identify an e-mail address that agrees with RFC.= As > that code is a state machine, IMHO it decides too early (when it finds _)= that > that string is not an e-mail address. AFAIR, that's not an one-line fix. > > euler=3D# select distinct token as email from ts_parse('default', > 'foo.bar@baz.com'); > =C2=A0 =C2=A0 =C2=A0email > =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80 > =C2=A0foo.bar@baz.com > (1 row) > > euler=3D# select distinct token as email from ts_parse('default', > 'foo+bar@baz.com'); > =C2=A0 =C2=A0email > =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80 > =C2=A0foo > =C2=A0+ > =C2=A0bar@baz.com > (3 rows) > > euler=3D# select distinct token as email from ts_parse('default', > 'foo_bar@baz.com'); > =C2=A0 =C2=A0email > =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80 > =C2=A0foo > =C2=A0bar@baz.com > =C2=A0_ > (3 rows) > > > -- > =C2=A0Euler Taveira de Oliveira > =C2=A0http://www.timbira.com/ > --=20 ------------------------------------------------------------------- Dan O'Hara Danara Software Systems, Inc. danarasoftware@gmail.com 613 288-8733
Euler Taveira de Oliveira escribió: > Robert Haas escreveu: > > I'm not real familiar with ts_parse(), but I'm thinking that it > > doesn't have any special casing for email addresses and is just > > intended to parse text for full-text-search - in which case splitting > > on _ is a pretty good algorithm. > > > It is a bug. The tsearch claims to identify types of tokens but it doesn't > correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails > to recognize an e-mail address. For example, foo+bar@baz.com is a valid e-mail > but the function fails to report that. It is similarly too-simplistic for other cases too, like file names (particularly where Windows filenames are concerned). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support