Re: bytea encode performance issues

Поиск
Список
Период
Сортировка
От Steve Atkins
Тема Re: bytea encode performance issues
Дата
Msg-id FE53470D-24B2-4994-A72C-EEEB832CE22E@blighty.com
обсуждение исходный текст
Ответ на Re: bytea encode performance issues  (Klint Gore <kgore4@une.edu.au>)
Ответы Re: bytea encode performance issues  ("Daniel Verite" <daniel@manitou-mail.org>)
Список pgsql-general
On Aug 7, 2008, at 5:28 PM, Klint Gore wrote:

> Alvaro Herrera wrote:
>> Merlin Moncure escribió:
>>
>> > er, I see the problem (single piece of text with multiple encodings
>> > inside) :-).  ok, it's more complicated than I thought.  still, you
>> > need to convert the email to utf8.  There simply must be a way,
>> > otherwise your emails are not well defined.  This is a client side
>> > problem...if you push it to the server in ascii, you can't use any
>> > server side text operations reliably.
>>
>> I think the solution is to get the encoding from the email header and
>> the set the client_encoding to that.  However, as soon as an email
>> with
>> an unsopported encoding comes by, you are stuck.
>>
> Why not leave it as bytea?  The postgres server has no encoding
> problems with storing whatever you want to throw at it, postgres
> client has no problem reading it back.  It's then up to the imap/
> pop3/whatever client to deal with it.  That's normally the way the
> email server world works.
>
> FWIW the RFC for email (822/2822) says it is all ASCII so it's not a
> problem at all as long as every email generator follows the IETF
> rules (body here is not just the text of the message - its the data
> after the blank line in the SMTP conversation until the CRLF.CRLF).

That's not actually true for email, though. Content-Transfer-Encoding:
8bit, combined with ESMTP 8BITMIME, for example.

So, yeah, you're right. Generally, email is too complex to deal with
in the database as anything other than an opaque bytea blob, along
with some metadata (that metadata might well include text fields that
contain the text content of the mail, for search, for instance).
Another option is to convert non-ascii mail to ascii before storing
it, but that's not as trivial as it sounds.

You cannot convert mail, in general, to utf8, as not all mail content
is textual.

> The 2 things that will make a difference to the query is 1. get rid
> of the encode call and 2. stop it being toasted
>
> Assuming that the dbmail code can't be changed yet
> 1. make encode a no-op.
> -   create schema foo;
> -   create function foo.encode (bytea,text) returns bytea as $$
> select $1 $$ language sql immutable;
> -   change postgresql.conf search_path to foo,pg_catalog,....
> This completly breaks encode so if anything uses it properly then
> it's broken that.  From the query we've seen, we don't know if it's
> needed or not.  What query do you get if you search for something
> that has utf or other encoding non-ASCII characters?  If it looks
> like the output of escape (i.e. client used PQescapeByteaConn on the
> search text), then the escape might be required.
>
> 2. dbmail already chunks email up into ~500k blocks.  If that is a
> configurable setting, turn it down to about 1.5k blocks.

Cheers,
   Steve


В списке pgsql-general по дате отправления:

Предыдущее
От: Klint Gore
Дата:
Сообщение: Re: bytea encode performance issues
Следующее
От: Lennin Caro
Дата:
Сообщение: Re: Create Table Dinamic