Обсуждение: Add ENCODING option to COPY

Поиск

Список

Период

Сортировка

Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

14 января 2011 г., 13:25:39

Here's the patch to add ENCODING option to COPY command.
The fundamental issue was explained months ago:

http://archives.postgresql.org/message-id/AANLkTikCt6bHXZjO_oX+JS7+G=jAQ7gVZPu0Owjcsbfb@mail.gmail.com

In short, client_encoding is not appropriate for copy operation so we
should need the specialized option for it. Robert Haas agreed with its
need later in the thread. Thanks.

The patch overrides client_encoding by the added ENCODING option, and
restores it as soon as copy is done. I see some complaints ask to use
pg_do_encoding_conversion() instead of
pg_client_to_server/server_to_client(), but the former will surely add
slight overhead per reading line and I believe copy is
performance-sensitive command.

I'll add this to the CF app.

Regards,

--
Hitoshi Harada

Вложения

copy_encoding.patch

Re: Add ENCODING option to COPY

От

Itagaki Takahiro

Дата:

24 января 2011 г., 22:07:53

On Sat, Jan 15, 2011 at 02:25, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> The patch overrides client_encoding by the added ENCODING option, and
> restores it as soon as copy is done.

We cannot do that because error messages should be encoded in the original
encoding even during COPY commands with encoding option. Error messages
could contain non-ASCII characters if lc_messages is set.

> I see some complaints ask to use
> pg_do_encoding_conversion() instead of
> pg_client_to_server/server_to_client(), but the former will surely add
> slight overhead per reading line

If we want to reduce the overhead, we should cache the conversion procedure
in CopyState. How about adding something like "FmgrInfo file_to_server_covv"
into it?

-- 
Itagaki Takahiro

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

25 января 2011 г., 11:24:36

2011/1/25 Itagaki Takahiro <itagaki.takahiro@gmail.com>:
> On Sat, Jan 15, 2011 at 02:25, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> The patch overrides client_encoding by the added ENCODING option, and
>> restores it as soon as copy is done.
>
> We cannot do that because error messages should be encoded in the original
> encoding even during COPY commands with encoding option. Error messages
> could contain non-ASCII characters if lc_messages is set.

Agreed.

>> I see some complaints ask to use
>> pg_do_encoding_conversion() instead of
>> pg_client_to_server/server_to_client(), but the former will surely add
>> slight overhead per reading line
>
> If we want to reduce the overhead, we should cache the conversion procedure
> in CopyState. How about adding something like "FmgrInfo file_to_server_covv"
> into it?

I looked down to the code and found that we cannot pass FmgrInfo * to
any functions defined in pg_wchar.h, since the header file is shared
in libpq, too.

For the record, I also tried pg_do_encoding_conversion() instead of
pg_client_to_server/server_to_client(), and the simple benchmark shows
it is too slow.

with 3000000 lines with 3 columns (~22MB tsv) COPY FROM

*utf8 -> utf8 (no conversion)
13428.233ms
13322.832ms
15661.093ms

*euc_jp -> utf8 (client_encoding)
17527.470ms
16457.452ms
16522.337ms

*euc_jp -> utf8 (pg_do_encoding_conversion)
20550.983ms
21425.313ms
20774.323ms

I'll check the code more if we have better alternatives.

Regards,


-- 
Hitoshi Harada

Re: Add ENCODING option to COPY

От

Robert Haas

Дата:

30 января 2011 г., 16:21:27

On Tue, Jan 25, 2011 at 10:24 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> I'll check the code more if we have better alternatives.

Where are we with this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

30 января 2011 г., 23:49:24

2011/1/31 Robert Haas <robertmhaas@gmail.com>:
> On Tue, Jan 25, 2011 at 10:24 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> I'll check the code more if we have better alternatives.
>
> Where are we with this?

I'll post another version today.

Regards,

-- 
Hitoshi Harada

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

31 января 2011 г., 11:07:28

2011/1/31 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2011/1/31 Robert Haas <robertmhaas@gmail.com>:
>> On Tue, Jan 25, 2011 at 10:24 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>>> I'll check the code more if we have better alternatives.
>>
>> Where are we with this?
>
> I'll post another version today.

Here's the patch.

Finally I concluded the concern Itagaki-san raised can be solved by
adding code that restores client_encoding in copy_in_error_callback. I
tested some encoding mismatch cases with this patch and saw
appropriate messages in NLS environment.

Regards,

--
Hitoshi Harada

Вложения

copy_encoding.v2.patch

Re: Add ENCODING option to COPY

От

Tom Lane

Дата:

31 января 2011 г., 12:25:47

Hitoshi Harada <umi.tanuki@gmail.com> writes:
> Finally I concluded the concern Itagaki-san raised can be solved by
> adding code that restores client_encoding in copy_in_error_callback.

That seems like an absolutely horrid idea.  Error context callbacks
should not have side-effects like that.  They're not guaranteed to be
called at all, let alone in any particular order.  In this case I'd also
be worried that the state needs to be fixed before elog.c reaches the
point of calling the callbacks --- there's nothing to say that it might
not try to translate some strings to the client encoding earlier than
that.

It might happen to work today (or at least in the scenarios you tested),
but it seems fragile as can be.
        regards, tom lane

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

31 января 2011 г., 22:17:05

2011/2/1 Tom Lane <tgl@sss.pgh.pa.us>:
> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>> Finally I concluded the concern Itagaki-san raised can be solved by
>> adding code that restores client_encoding in copy_in_error_callback.
>
> It might happen to work today (or at least in the scenarios you tested),
> but it seems fragile as can be.

Although I thought its fragile-ness was acceptable to avoid making the
patch too complex, I agree with you.
The third patch is attached, modifying mb routines so that they can
receive conversion procedures as FmgrInof * and save the function
pointer in CopyState.
I tested it with encoding option and could not see performance slowdown.

Regards,

--
Hitoshi Harada

Вложения

copy_encoding.v3.patch

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

01 февраля 2011 г., 00:08:26

2011/2/1 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2011/2/1 Tom Lane <tgl@sss.pgh.pa.us>:
>> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>>> Finally I concluded the concern Itagaki-san raised can be solved by
>>> adding code that restores client_encoding in copy_in_error_callback.
>>
>> It might happen to work today (or at least in the scenarios you tested),
>> but it seems fragile as can be.
>
> Although I thought its fragile-ness was acceptable to avoid making the
> patch too complex, I agree with you.
> The third patch is attached, modifying mb routines so that they can
> receive conversion procedures as FmgrInof * and save the function
> pointer in CopyState.
> I tested it with encoding option and could not see performance slowdown.
>

Hmm, sorry, the patch was wrong. Correct version is attached.

Regards,



--
Hitoshi Harada

Вложения

copy_encoding.v4.patch

Re: Add ENCODING option to COPY

От

Itagaki Takahiro

Дата:

04 февраля 2011 г., 01:40:23

On Tue, Feb 1, 2011 at 13:08, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> The third patch is attached, modifying mb routines so that they can
>> receive conversion procedures as FmgrInof * and save the function
>> pointer in CopyState.
>> I tested it with encoding option and could not see performance slowdown.
> Hmm, sorry, the patch was wrong. Correct version is attached.

Here is a brief review for the patch.

* Syntax: ENCODING encoding vs. ENCODING 'encoding'
We don't have to quote encoding names in the patch, but we always need
quotes for CREATE DATABASE WITH ENCODING. I think we should modify
CREATE DATABASE to accept unquoted encoding names aside from the patch.

Changes in pg_wchar.h are the most debatable parts of the patch.
The patch adds pg_cached_encoding_conversion(). We normally use
pg_do_encoding_conversion(), but it is slower than the proposed
function because it lookups conversion proc from catalog every call.

* Can we use #ifndef FRONTEND in the header?
Usage of fmgr.h members will broke client applications without the #ifdef,
but I guess client apps don't always have definitions of FRONTEND.
If we don't want to change pg_wchar.h, pg_conversion_fn.h might be
a better header for the new API because FindDefaultConversion() is in it.

Changes in copy.c looks good except a few trivial cosmetic issues:

* encoding_option could be a local variable instead of cstate's member.
* cstate->client_encoding is renamed to target_encoding, but I prefer file_encoding or remote_encoding.
* CopyState can have conv_proc entity as a member instead of the pointer.
* need_transcoding checks could be replaced with conv_proc IS NULL check.

-- 
Itagaki Takahiro

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

04 февраля 2011 г., 10:55:34

2011/2/4 Itagaki Takahiro <itagaki.takahiro@gmail.com>:
> On Tue, Feb 1, 2011 at 13:08, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>>> The third patch is attached, modifying mb routines so that they can
>>> receive conversion procedures as FmgrInof * and save the function
>>> pointer in CopyState.
>>> I tested it with encoding option and could not see performance slowdown.
>> Hmm, sorry, the patch was wrong. Correct version is attached.
>
> Here is a brief review for the patch.

Thanks for the review.

> * Syntax: ENCODING encoding vs. ENCODING 'encoding'
> We don't have to quote encoding names in the patch, but we always need
> quotes for CREATE DATABASE WITH ENCODING. I think we should modify
> CREATE DATABASE to accept unquoted encoding names aside from the patch.

I followed the syntax in SET client_encoding TO xxx, where you don't
need quote xxx. I didn't care CREATE DATABASE.

> Changes in pg_wchar.h are the most debatable parts of the patch.
> The patch adds pg_cached_encoding_conversion(). We normally use
> pg_do_encoding_conversion(), but it is slower than the proposed
> function because it lookups conversion proc from catalog every call.
>
> * Can we use #ifndef FRONTEND in the header?
> Usage of fmgr.h members will broke client applications without the #ifdef,
> but I guess client apps don't always have definitions of FRONTEND.
> If we don't want to change pg_wchar.h, pg_conversion_fn.h might be
> a better header for the new API because FindDefaultConversion() is in it.

It doesn't look to me like clear solution. FindDefaultConversion() is
implemented in pg_conversion.c, whereas
pg_cached_encoding_conversion() in mbutils.c. Maybe adding another
header, namely pg_wchar_backend.h?

> * CopyState can have conv_proc entity as a member instead of the pointer.
> * need_transcoding checks could be replaced with conv_proc IS NULL check.

No, need_transcoding means you need verification even if the target
and server encoding is the same. See comments in CopyTo().

If pg_wchar_backend .h is acceptable, I'll post the revised patch.


Regards,

-- 
Hitoshi Harada

Re: Add ENCODING option to COPY

От

Tom Lane

Дата:

04 февраля 2011 г., 11:47:15

Itagaki Takahiro <itagaki.takahiro@gmail.com> writes:
> * Syntax: ENCODING encoding vs. ENCODING 'encoding'
> We don't have to quote encoding names in the patch, but we always need
> quotes for CREATE DATABASE WITH ENCODING. I think we should modify
> CREATE DATABASE to accept unquoted encoding names aside from the patch.

The reason that we use quotes in CREATE DATABASE is that encoding names
aren't assumed to be valid SQL identifiers.  If this patch isn't
following the CREATE DATABASE precedent, it's the patch that's wrong,
not CREATE DATABASE.

> Changes in pg_wchar.h are the most debatable parts of the patch.
> The patch adds pg_cached_encoding_conversion(). We normally use
> pg_do_encoding_conversion(), but it is slower than the proposed
> function because it lookups conversion proc from catalog every call.

Why should callers need to be changed for that?  Just make the existing
function use caching.

> * Can we use #ifndef FRONTEND in the header?
> Usage of fmgr.h members will broke client applications without the #ifdef,
> but I guess client apps don't always have definitions of FRONTEND.
> If we don't want to change pg_wchar.h, pg_conversion_fn.h might be
> a better header for the new API because FindDefaultConversion() is in it.

Yeah, putting backend-only stuff into that header is a nonstarter.
        regards, tom lane

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

04 февраля 2011 г., 12:27:08

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
> Itagaki Takahiro <itagaki.takahiro@gmail.com> writes:
>> * Syntax: ENCODING encoding vs. ENCODING 'encoding'
>> We don't have to quote encoding names in the patch, but we always need
>> quotes for CREATE DATABASE WITH ENCODING. I think we should modify
>> CREATE DATABASE to accept unquoted encoding names aside from the patch.
>
> The reason that we use quotes in CREATE DATABASE is that encoding names
> aren't assumed to be valid SQL identifiers.  If this patch isn't
> following the CREATE DATABASE precedent, it's the patch that's wrong,
> not CREATE DATABASE.

What about SET client_encoding TO encoding?

>> Changes in pg_wchar.h are the most debatable parts of the patch.
>> The patch adds pg_cached_encoding_conversion(). We normally use
>> pg_do_encoding_conversion(), but it is slower than the proposed
>> function because it lookups conversion proc from catalog every call.
>
> Why should callers need to be changed for that?  Just make the existing
> function use caching.

Because the demanded behavior is similar to
pg_client_to_server/server_to_client, which caches functions as
specialized client encoding and database encoding as the file local
variables. We didn't have such spaces to cache functions for other
encoding conversions, so decided to cache them in CopyState and pass
them to the mb routine.

Regards,


--
Hitoshi Harada

Re: Add ENCODING option to COPY

От

Tom Lane

Дата:

04 февраля 2011 г., 12:30:40

Hitoshi Harada <umi.tanuki@gmail.com> writes:
> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
>> The reason that we use quotes in CREATE DATABASE is that encoding names
>> aren't assumed to be valid SQL identifiers. �If this patch isn't
>> following the CREATE DATABASE precedent, it's the patch that's wrong,
>> not CREATE DATABASE.

> What about SET client_encoding TO encoding?

SET is in its own little world --- it will interchangeably take names
with or without quotes.  It is not a precedent to follow elsewhere.
        regards, tom lane

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

04 февраля 2011 г., 12:41:54

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
>>> The reason that we use quotes in CREATE DATABASE is that encoding names
>>> aren't assumed to be valid SQL identifiers.  If this patch isn't
>>> following the CREATE DATABASE precedent, it's the patch that's wrong,
>>> not CREATE DATABASE.
>
>> What about SET client_encoding TO encoding?
>
> SET is in its own little world --- it will interchangeably take names
> with or without quotes.  It is not a precedent to follow elsewhere.

I see. I'll update my patch, after the mb change discussion gets done.

>> * Can we use #ifndef FRONTEND in the header?
>> Usage of fmgr.h members will broke client applications without the #ifdef,
>> but I guess client apps don't always have definitions of FRONTEND.
>> If we don't want to change pg_wchar.h, pg_conversion_fn.h might be
>> a better header for the new API because FindDefaultConversion() is in it.
>
> Yeah, putting backend-only stuff into that header is a nonstarter.

Do you mean you think it' all right to define
pg_cached_encoding_conversion() in pg_conversion_fn.h?


Regards,

--
Hitoshi Harada

Re: Add ENCODING option to COPY

От

Tom Lane

Дата:

04 февраля 2011 г., 12:52:14

Hitoshi Harada <umi.tanuki@gmail.com> writes:
> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
>> Yeah, putting backend-only stuff into that header is a nonstarter.

> Do you mean you think it' all right to define
> pg_cached_encoding_conversion() in pg_conversion_fn.h?

That seems pretty random too.  I still think you've designed this API
badly and it'd be better to avoid exposing the FmgrInfo to callers
by letting the function manage the cache internally.
        regards, tom lane

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

04 февраля 2011 г., 14:54:51

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
>>> Yeah, putting backend-only stuff into that header is a nonstarter.
>
>> Do you mean you think it' all right to define
>> pg_cached_encoding_conversion() in pg_conversion_fn.h?
>
> That seems pretty random too.  I still think you've designed this API
> badly and it'd be better to avoid exposing the FmgrInfo to callers
> by letting the function manage the cache internally.

I'll try it in a few days, but only making struct that holds FmgrInfo
in a file local like tuplestorestate comes up with so far....

Regards,

--
Hitoshi Harada

Re: Add ENCODING option to COPY

От

Robert Haas

Дата:

08 февраля 2011 г., 00:01:26

On Fri, Feb 4, 2011 at 1:54 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
>> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>>> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
>>>> Yeah, putting backend-only stuff into that header is a nonstarter.
>>
>>> Do you mean you think it' all right to define
>>> pg_cached_encoding_conversion() in pg_conversion_fn.h?
>>
>> That seems pretty random too.  I still think you've designed this API
>> badly and it'd be better to avoid exposing the FmgrInfo to callers
>> by letting the function manage the cache internally.
>
> I'll try it in a few days, but only making struct that holds FmgrInfo
> in a file local like tuplestorestate comes up with so far....

We've been through several iterations of this patch now and haven't
come up with something committable.  I think it's time to mark this
one Returned with Feedback and revisit this for 9.2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

08 февраля 2011 г., 02:02:37

2011/2/8 Robert Haas <robertmhaas@gmail.com>:
> On Fri, Feb 4, 2011 at 1:54 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
>>> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>>>> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:
>>>>> Yeah, putting backend-only stuff into that header is a nonstarter.
>>>
>>>> Do you mean you think it' all right to define
>>>> pg_cached_encoding_conversion() in pg_conversion_fn.h?
>>>
>>> That seems pretty random too.  I still think you've designed this API
>>> badly and it'd be better to avoid exposing the FmgrInfo to callers
>>> by letting the function manage the cache internally.
>>
>> I'll try it in a few days, but only making struct that holds FmgrInfo
>> in a file local like tuplestorestate comes up with so far....
>
> We've been through several iterations of this patch now and haven't
> come up with something committable.  I think it's time to mark this
> one Returned with Feedback and revisit this for 9.2.

As I've been thinking these days. The design isn't fixed yet even now.
Thanks for all the reviews.

Regards,


--
Hitoshi Harada

Re: Add ENCODING option to COPY

От

Peter Eisentraut

Дата:

08 февраля 2011 г., 11:54:47

On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote:
> The reason that we use quotes in CREATE DATABASE is that encoding
> names aren't assumed to be valid SQL identifiers.  If this patch isn't
> following the CREATE DATABASE precedent, it's the patch that's wrong,
> not CREATE DATABASE.

Since encoding names are built-in and therefore well known, and the
names have been aligned with the SQL standard names, which are
identifiers, I don't think this argument is valid (anymore).

It probably shouldn't be changed inconsistently as part of an unrelated
patch, but I think the idea has merit.

Re: Add ENCODING option to COPY

От

Tom Lane

Дата:

08 февраля 2011 г., 11:57:06

Peter Eisentraut <peter_e@gmx.net> writes:
> On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote:
>> The reason that we use quotes in CREATE DATABASE is that encoding
>> names aren't assumed to be valid SQL identifiers.  If this patch isn't
>> following the CREATE DATABASE precedent, it's the patch that's wrong,
>> not CREATE DATABASE.

> Since encoding names are built-in and therefore well known, and the
> names have been aligned with the SQL standard names, which are
> identifiers, I don't think this argument is valid (anymore).

What about "UTF-8"?
        regards, tom lane

Re: Add ENCODING option to COPY

От

Hitoshi Harada

Дата:

08 февраля 2011 г., 12:39:16

2011/2/9 Tom Lane <tgl@sss.pgh.pa.us>:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote:
>>> The reason that we use quotes in CREATE DATABASE is that encoding
>>> names aren't assumed to be valid SQL identifiers.  If this patch isn't
>>> following the CREATE DATABASE precedent, it's the patch that's wrong,
>>> not CREATE DATABASE.
>
>> Since encoding names are built-in and therefore well known, and the
>> names have been aligned with the SQL standard names, which are
>> identifiers, I don't think this argument is valid (anymore).
>
> What about "UTF-8"?

Then, quote it?

db1=# set client_encoding to utf-8;
ERROR:  syntax error at or near "-"

Regards,


--
Hitoshi Harada

Re: Add ENCODING option to COPY

От

Peter Eisentraut

Дата:

08 февраля 2011 г., 14:47:59

On tis, 2011-02-08 at 10:53 -0500, Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote:
> >> The reason that we use quotes in CREATE DATABASE is that encoding
> >> names aren't assumed to be valid SQL identifiers.  If this patch isn't
> >> following the CREATE DATABASE precedent, it's the patch that's wrong,
> >> not CREATE DATABASE.
> 
> > Since encoding names are built-in and therefore well known, and the
> > names have been aligned with the SQL standard names, which are
> > identifiers, I don't think this argument is valid (anymore).
> 
> What about "UTF-8"?

The canonical name of that is UTF8.  But you can quote it if you want to
spell it differently.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Add ENCODING option to COPY

Вложения

Вложения

Вложения

Вложения