Re: [GENERAL] Text search parser's treatment of URLs and emails

Поиск
Список
Период
Сортировка
От Christian Ullrich
Тема Re: [GENERAL] Text search parser's treatment of URLs and emails
Дата
Msg-id i95ive$41m$1@dough.gmane.org
обсуждение исходный текст
Ответ на Re: [GENERAL] Text search parser's treatment of URLs and emails  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-hackers
* Bruce Momjian wrote:

> Thom Brown wrote:

>> Also:
>>
>> SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com');
>>
>> Yields:
>>
>>     alias   |   description   |       token
>> -----------+-----------------+--------------------
>>   asciiword | Word, all ASCII | myname
>>   blank     | Space symbols   | +
>>   email     | Email address   | priority@gmail.com
>> (3 rows)
>>
>> The entire string I entered is a valid email address, and isn't
>> totally uncommon.  Shouldn't that take such email address styles be
>> taken into account?  The example above incorrectly identifies the
>> email address since the real destination address would most likely be
>> myname@gmail.com.
>
> I had no idea '+' could be part of an email address, and in fact it is a
> modifier that is stripped off when delivering the email:

No, it's not. Strictly speaking, "+" is simply one of many characters 
that are valid in the local-part of an e-mail address according to RFC 
2822 (and 822, which was even more lenient there). The plus sign does 
not have any intrinsic semantics, except that it is obviously different 
from any other character for purposes of comparing addresses.

Even among applications that make decisions based on the value of 
various parts of e-mail addresses (usually MTAs when forwarding 
messages), the only ones that should be assigning special meaning to the 
plus sign are the MTAs responsible for delivering messages to their 
recipients in the recipient domain. A database that is only used for 
storing such addresses definitely should not attempt to divine what the 
_sender_ of the message meant when he put that plus sign in, or what it 
might mean to the _recipient_, who has no control over what people use 
as addresses when they send him e-mail.

Plainly put, the local-part should be treated as opaque everywhere 
outside the "administrative scope" of the recipient, and if you don't 
know whether you are in that scope, you are not. Splitting the 
local-part into subparts based on arbitrary rules that have no actual 
knowledge of the policies in place at the organization that assigned the 
address can only be a mistake.

Of course, the application that is using the database is free to use a 
ts configuration that does assign such meaning, if it has a reason to do 
so.

Examples:

- chris+postgresql@chrullrich.net
  Looks like I have a dedicated folder for messages concerning  PostgreSQL. Now, _I_ know that I do not have such a
folder,and  that the suffix is meaningless. Nobody else can know for sure.
 

- jane+john@example.com
  What is this?
  - A special suffix that John uses when sending messages    to Jane, so they are forwarded to her BlackBerry with high
  priority?  - A folder for Jane's large collection of "Dear John" letters?  - Or is it simply Jane's and John's
everydayaddress?
 

(Disclosure: I am what might be called a "plus sign nut". I routinely 
complain to webmasters and such when their applications try to tell me 
that the plus sign is not allowed in e-mail addresses. If you think I 
feel too strongly about this, you are free to disregard my message.)

-- 
Christian



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Radosław Smogura
Дата:
Сообщение: Re: [JDBC] Support for JDBC setQueryTimeout, et al.
Следующее
От: David Newall
Дата:
Сообщение: Re: [BUGS] rollback to savepoint leads to transaction already in progress