Tom Lane wrote:
> I wrote:
>
>> Actually, I have to take back that objection: on closer look, COPY
>> validates the data only once and does so before applying its own
>> backslash-escaping rules. So there is a risk in that path too.
>>
>
>
>> It's still pretty annoying to be validating the data twice in the
>> common case where no backslash reduction occurred, but I'm not sure
>> I see any good way to avoid it.
>>
>
> Further thought here: if we put encoding verification into textin()
> and related functions, could we *remove* it from COPY IN, in the common
> case where client and server encodings are the same? Currently, copy.c
> forces a trip through pg_client_to_server for multibyte encodings
> even when the encodings are the same, so as to perform validation.
> But I'm wondering whether we'd still need that. There's no risk of
> SQL injection in COPY data. Bogus input encoding could possibly
> make for confusion about where the field boundaries are, but bad
> data is bad data in any case.
>
> regards, tom lane
>
>
Here are some timing tests in 1m rows of random utf8 encoded 100 char
data. It doesn't look to me like the saving you're suggesting is worth
the trouble.
baseline:
Time: 28228.325 ms
Time: 25987.740 ms
Time: 25950.707 ms
Time: 25756.371 ms
Time: 27589.719 ms
Time: 25774.417 ms
after adding suggested extra test to textin():
Time: 26722.376 ms
Time: 28343.226 ms
Time: 26529.364 ms
Time: 28020.140 ms
Time: 24836.853 ms
Time: 24860.530 ms
Script is:
\timing
create table xyz (x text);
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
drop table xyz;
Test platform: FC6, Athlon64.
cheers
andrew