Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores

Поиск

Список

Период

Сортировка

От	Dan O'Hara
Тема	Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores
Дата	22 октября 2009 г. 20:23:42
Msg-id	557802370910221254k624306eg81ae6176eb3bd9d4@mail.gmail.com обсуждение исходный текст
Ответ на	Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores (Euler Taveira de Oliveira <euler@timbira.com>)
Список	pgsql-bugs

Дерево обсуждения

I agree that it isn't easy to determine if given text is a valid email
address.  As I couldn't use ts_parse, I ended up using a regex, which
worked substantially better at pulling out the emails from the text
stream.  I haven't looked at the code, but perhaps it is possible to
do the same thing here?  Even a regex that is 99% correct would be
better than the current tokenizer which is only right about 80-85% of
the time.

My workaround looked something like this:

  select regexp_matches(resumetext,E'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4=
}','gi')
as email
        from "Resume"
cheers
Dan

On Thu, Oct 22, 2009 at 3:39 PM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:
> Robert Haas escreveu:
>> I'm not real familiar with ts_parse(), but I'm thinking that it
>> doesn't have any special casing for email addresses and is just
>> intended to parse text for full-text-search - in which case splitting
>> on _ is a pretty good algorithm.
>>
> It is a bug. The tsearch claims to identify types of tokens but it doesn't
> correctly identify any valid e-mail addresses. As Dan stated ts_parse() f=
ails
> to recognize an e-mail address. For example, foo+bar@baz.com is a valid e=
-mail
> but the function fails to report that.
>
> It is not that simple to identify an e-mail address that agrees with RFC.=
 As
> that code is a state machine, IMHO it decides too early (when it finds _)=
 that
> that string is not an e-mail address. AFAIR, that's not an one-line fix.
>
> euler=3D# select distinct token as email from ts_parse('default',
> 'foo.bar@baz.com');
> =C2=A0 =C2=A0 =C2=A0email
> =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80
> =C2=A0foo.bar@baz.com
> (1 row)
>
> euler=3D# select distinct token as email from ts_parse('default',
> 'foo+bar@baz.com');
> =C2=A0 =C2=A0email
> =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80
> =C2=A0foo
> =C2=A0+
> =C2=A0bar@baz.com
> (3 rows)
>
> euler=3D# select distinct token as email from ts_parse('default',
> 'foo_bar@baz.com');
> =C2=A0 =C2=A0email
> =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80
> =C2=A0foo
> =C2=A0bar@baz.com
> =C2=A0_
> (3 rows)
>
>
> --
> =C2=A0Euler Taveira de Oliveira
> =C2=A0http://www.timbira.com/
>



--=20
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftware@gmail.com
613 288-8733

В списке pgsql-bugs по дате отправления:

Предыдущее

От: Stephen Frost
Дата: 22 октября 2009 г., 19:42:44
Сообщение: psql -1 -f - busted

Следующее

От: Andrew Gierth
Дата: 22 октября 2009 г., 21:19:45
Сообщение: Re: BUG #5126: convert_to preventing index scan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores

Предыдущее

Следующее