Обсуждение: Add ENCODING option to COPY
Here's the patch to add ENCODING option to COPY command. The fundamental issue was explained months ago: http://archives.postgresql.org/message-id/AANLkTikCt6bHXZjO_oX+JS7+G=jAQ7gVZPu0Owjcsbfb@mail.gmail.com In short, client_encoding is not appropriate for copy operation so we should need the specialized option for it. Robert Haas agreed with its need later in the thread. Thanks. The patch overrides client_encoding by the added ENCODING option, and restores it as soon as copy is done. I see some complaints ask to use pg_do_encoding_conversion() instead of pg_client_to_server/server_to_client(), but the former will surely add slight overhead per reading line and I believe copy is performance-sensitive command. I'll add this to the CF app. Regards, -- Hitoshi Harada
Вложения
On Sat, Jan 15, 2011 at 02:25, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > The patch overrides client_encoding by the added ENCODING option, and > restores it as soon as copy is done. We cannot do that because error messages should be encoded in the original encoding even during COPY commands with encoding option. Error messages could contain non-ASCII characters if lc_messages is set. > I see some complaints ask to use > pg_do_encoding_conversion() instead of > pg_client_to_server/server_to_client(), but the former will surely add > slight overhead per reading line If we want to reduce the overhead, we should cache the conversion procedure in CopyState. How about adding something like "FmgrInfo file_to_server_covv" into it? -- Itagaki Takahiro
2011/1/25 Itagaki Takahiro <itagaki.takahiro@gmail.com>: > On Sat, Jan 15, 2011 at 02:25, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> The patch overrides client_encoding by the added ENCODING option, and >> restores it as soon as copy is done. > > We cannot do that because error messages should be encoded in the original > encoding even during COPY commands with encoding option. Error messages > could contain non-ASCII characters if lc_messages is set. Agreed. >> I see some complaints ask to use >> pg_do_encoding_conversion() instead of >> pg_client_to_server/server_to_client(), but the former will surely add >> slight overhead per reading line > > If we want to reduce the overhead, we should cache the conversion procedure > in CopyState. How about adding something like "FmgrInfo file_to_server_covv" > into it? I looked down to the code and found that we cannot pass FmgrInfo * to any functions defined in pg_wchar.h, since the header file is shared in libpq, too. For the record, I also tried pg_do_encoding_conversion() instead of pg_client_to_server/server_to_client(), and the simple benchmark shows it is too slow. with 3000000 lines with 3 columns (~22MB tsv) COPY FROM *utf8 -> utf8 (no conversion) 13428.233ms 13322.832ms 15661.093ms *euc_jp -> utf8 (client_encoding) 17527.470ms 16457.452ms 16522.337ms *euc_jp -> utf8 (pg_do_encoding_conversion) 20550.983ms 21425.313ms 20774.323ms I'll check the code more if we have better alternatives. Regards, -- Hitoshi Harada
On Tue, Jan 25, 2011 at 10:24 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > I'll check the code more if we have better alternatives. Where are we with this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/1/31 Robert Haas <robertmhaas@gmail.com>: > On Tue, Jan 25, 2011 at 10:24 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> I'll check the code more if we have better alternatives. > > Where are we with this? I'll post another version today. Regards, -- Hitoshi Harada
2011/1/31 Hitoshi Harada <umi.tanuki@gmail.com>: > 2011/1/31 Robert Haas <robertmhaas@gmail.com>: >> On Tue, Jan 25, 2011 at 10:24 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >>> I'll check the code more if we have better alternatives. >> >> Where are we with this? > > I'll post another version today. Here's the patch. Finally I concluded the concern Itagaki-san raised can be solved by adding code that restores client_encoding in copy_in_error_callback. I tested some encoding mismatch cases with this patch and saw appropriate messages in NLS environment. Regards, -- Hitoshi Harada
Вложения
Hitoshi Harada <umi.tanuki@gmail.com> writes: > Finally I concluded the concern Itagaki-san raised can be solved by > adding code that restores client_encoding in copy_in_error_callback. That seems like an absolutely horrid idea. Error context callbacks should not have side-effects like that. They're not guaranteed to be called at all, let alone in any particular order. In this case I'd also be worried that the state needs to be fixed before elog.c reaches the point of calling the callbacks --- there's nothing to say that it might not try to translate some strings to the client encoding earlier than that. It might happen to work today (or at least in the scenarios you tested), but it seems fragile as can be. regards, tom lane
2011/2/1 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> Finally I concluded the concern Itagaki-san raised can be solved by >> adding code that restores client_encoding in copy_in_error_callback. > > It might happen to work today (or at least in the scenarios you tested), > but it seems fragile as can be. Although I thought its fragile-ness was acceptable to avoid making the patch too complex, I agree with you. The third patch is attached, modifying mb routines so that they can receive conversion procedures as FmgrInof * and save the function pointer in CopyState. I tested it with encoding option and could not see performance slowdown. Regards, -- Hitoshi Harada
Вложения
2011/2/1 Hitoshi Harada <umi.tanuki@gmail.com>: > 2011/2/1 Tom Lane <tgl@sss.pgh.pa.us>: >> Hitoshi Harada <umi.tanuki@gmail.com> writes: >>> Finally I concluded the concern Itagaki-san raised can be solved by >>> adding code that restores client_encoding in copy_in_error_callback. >> >> It might happen to work today (or at least in the scenarios you tested), >> but it seems fragile as can be. > > Although I thought its fragile-ness was acceptable to avoid making the > patch too complex, I agree with you. > The third patch is attached, modifying mb routines so that they can > receive conversion procedures as FmgrInof * and save the function > pointer in CopyState. > I tested it with encoding option and could not see performance slowdown. > Hmm, sorry, the patch was wrong. Correct version is attached. Regards, -- Hitoshi Harada
Вложения
On Tue, Feb 1, 2011 at 13:08, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> The third patch is attached, modifying mb routines so that they can >> receive conversion procedures as FmgrInof * and save the function >> pointer in CopyState. >> I tested it with encoding option and could not see performance slowdown. > Hmm, sorry, the patch was wrong. Correct version is attached. Here is a brief review for the patch. * Syntax: ENCODING encoding vs. ENCODING 'encoding' We don't have to quote encoding names in the patch, but we always need quotes for CREATE DATABASE WITH ENCODING. I think we should modify CREATE DATABASE to accept unquoted encoding names aside from the patch. Changes in pg_wchar.h are the most debatable parts of the patch. The patch adds pg_cached_encoding_conversion(). We normally use pg_do_encoding_conversion(), but it is slower than the proposed function because it lookups conversion proc from catalog every call. * Can we use #ifndef FRONTEND in the header? Usage of fmgr.h members will broke client applications without the #ifdef, but I guess client apps don't always have definitions of FRONTEND. If we don't want to change pg_wchar.h, pg_conversion_fn.h might be a better header for the new API because FindDefaultConversion() is in it. Changes in copy.c looks good except a few trivial cosmetic issues: * encoding_option could be a local variable instead of cstate's member. * cstate->client_encoding is renamed to target_encoding, but I prefer file_encoding or remote_encoding. * CopyState can have conv_proc entity as a member instead of the pointer. * need_transcoding checks could be replaced with conv_proc IS NULL check. -- Itagaki Takahiro
2011/2/4 Itagaki Takahiro <itagaki.takahiro@gmail.com>: > On Tue, Feb 1, 2011 at 13:08, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >>> The third patch is attached, modifying mb routines so that they can >>> receive conversion procedures as FmgrInof * and save the function >>> pointer in CopyState. >>> I tested it with encoding option and could not see performance slowdown. >> Hmm, sorry, the patch was wrong. Correct version is attached. > > Here is a brief review for the patch. Thanks for the review. > * Syntax: ENCODING encoding vs. ENCODING 'encoding' > We don't have to quote encoding names in the patch, but we always need > quotes for CREATE DATABASE WITH ENCODING. I think we should modify > CREATE DATABASE to accept unquoted encoding names aside from the patch. I followed the syntax in SET client_encoding TO xxx, where you don't need quote xxx. I didn't care CREATE DATABASE. > Changes in pg_wchar.h are the most debatable parts of the patch. > The patch adds pg_cached_encoding_conversion(). We normally use > pg_do_encoding_conversion(), but it is slower than the proposed > function because it lookups conversion proc from catalog every call. > > * Can we use #ifndef FRONTEND in the header? > Usage of fmgr.h members will broke client applications without the #ifdef, > but I guess client apps don't always have definitions of FRONTEND. > If we don't want to change pg_wchar.h, pg_conversion_fn.h might be > a better header for the new API because FindDefaultConversion() is in it. It doesn't look to me like clear solution. FindDefaultConversion() is implemented in pg_conversion.c, whereas pg_cached_encoding_conversion() in mbutils.c. Maybe adding another header, namely pg_wchar_backend.h? > * CopyState can have conv_proc entity as a member instead of the pointer. > * need_transcoding checks could be replaced with conv_proc IS NULL check. No, need_transcoding means you need verification even if the target and server encoding is the same. See comments in CopyTo(). If pg_wchar_backend .h is acceptable, I'll post the revised patch. Regards, -- Hitoshi Harada
Itagaki Takahiro <itagaki.takahiro@gmail.com> writes: > * Syntax: ENCODING encoding vs. ENCODING 'encoding' > We don't have to quote encoding names in the patch, but we always need > quotes for CREATE DATABASE WITH ENCODING. I think we should modify > CREATE DATABASE to accept unquoted encoding names aside from the patch. The reason that we use quotes in CREATE DATABASE is that encoding names aren't assumed to be valid SQL identifiers. If this patch isn't following the CREATE DATABASE precedent, it's the patch that's wrong, not CREATE DATABASE. > Changes in pg_wchar.h are the most debatable parts of the patch. > The patch adds pg_cached_encoding_conversion(). We normally use > pg_do_encoding_conversion(), but it is slower than the proposed > function because it lookups conversion proc from catalog every call. Why should callers need to be changed for that? Just make the existing function use caching. > * Can we use #ifndef FRONTEND in the header? > Usage of fmgr.h members will broke client applications without the #ifdef, > but I guess client apps don't always have definitions of FRONTEND. > If we don't want to change pg_wchar.h, pg_conversion_fn.h might be > a better header for the new API because FindDefaultConversion() is in it. Yeah, putting backend-only stuff into that header is a nonstarter. regards, tom lane
2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: > Itagaki Takahiro <itagaki.takahiro@gmail.com> writes: >> * Syntax: ENCODING encoding vs. ENCODING 'encoding' >> We don't have to quote encoding names in the patch, but we always need >> quotes for CREATE DATABASE WITH ENCODING. I think we should modify >> CREATE DATABASE to accept unquoted encoding names aside from the patch. > > The reason that we use quotes in CREATE DATABASE is that encoding names > aren't assumed to be valid SQL identifiers. If this patch isn't > following the CREATE DATABASE precedent, it's the patch that's wrong, > not CREATE DATABASE. What about SET client_encoding TO encoding? >> Changes in pg_wchar.h are the most debatable parts of the patch. >> The patch adds pg_cached_encoding_conversion(). We normally use >> pg_do_encoding_conversion(), but it is slower than the proposed >> function because it lookups conversion proc from catalog every call. > > Why should callers need to be changed for that? Just make the existing > function use caching. Because the demanded behavior is similar to pg_client_to_server/server_to_client, which caches functions as specialized client encoding and database encoding as the file local variables. We didn't have such spaces to cache functions for other encoding conversions, so decided to cache them in CopyState and pass them to the mb routine. Regards, -- Hitoshi Harada
Hitoshi Harada <umi.tanuki@gmail.com> writes: > 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: >> The reason that we use quotes in CREATE DATABASE is that encoding names >> aren't assumed to be valid SQL identifiers. �If this patch isn't >> following the CREATE DATABASE precedent, it's the patch that's wrong, >> not CREATE DATABASE. > What about SET client_encoding TO encoding? SET is in its own little world --- it will interchangeably take names with or without quotes. It is not a precedent to follow elsewhere. regards, tom lane
2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: >>> The reason that we use quotes in CREATE DATABASE is that encoding names >>> aren't assumed to be valid SQL identifiers. If this patch isn't >>> following the CREATE DATABASE precedent, it's the patch that's wrong, >>> not CREATE DATABASE. > >> What about SET client_encoding TO encoding? > > SET is in its own little world --- it will interchangeably take names > with or without quotes. It is not a precedent to follow elsewhere. I see. I'll update my patch, after the mb change discussion gets done. >> * Can we use #ifndef FRONTEND in the header? >> Usage of fmgr.h members will broke client applications without the #ifdef, >> but I guess client apps don't always have definitions of FRONTEND. >> If we don't want to change pg_wchar.h, pg_conversion_fn.h might be >> a better header for the new API because FindDefaultConversion() is in it. > > Yeah, putting backend-only stuff into that header is a nonstarter. Do you mean you think it' all right to define pg_cached_encoding_conversion() in pg_conversion_fn.h? Regards, -- Hitoshi Harada
Hitoshi Harada <umi.tanuki@gmail.com> writes: > 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: >> Yeah, putting backend-only stuff into that header is a nonstarter. > Do you mean you think it' all right to define > pg_cached_encoding_conversion() in pg_conversion_fn.h? That seems pretty random too. I still think you've designed this API badly and it'd be better to avoid exposing the FmgrInfo to callers by letting the function manage the cache internally. regards, tom lane
2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: >>> Yeah, putting backend-only stuff into that header is a nonstarter. > >> Do you mean you think it' all right to define >> pg_cached_encoding_conversion() in pg_conversion_fn.h? > > That seems pretty random too. I still think you've designed this API > badly and it'd be better to avoid exposing the FmgrInfo to callers > by letting the function manage the cache internally. I'll try it in a few days, but only making struct that holds FmgrInfo in a file local like tuplestorestate comes up with so far.... Regards, -- Hitoshi Harada
On Fri, Feb 4, 2011 at 1:54 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: >> Hitoshi Harada <umi.tanuki@gmail.com> writes: >>> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: >>>> Yeah, putting backend-only stuff into that header is a nonstarter. >> >>> Do you mean you think it' all right to define >>> pg_cached_encoding_conversion() in pg_conversion_fn.h? >> >> That seems pretty random too. I still think you've designed this API >> badly and it'd be better to avoid exposing the FmgrInfo to callers >> by letting the function manage the cache internally. > > I'll try it in a few days, but only making struct that holds FmgrInfo > in a file local like tuplestorestate comes up with so far.... We've been through several iterations of this patch now and haven't come up with something committable. I think it's time to mark this one Returned with Feedback and revisit this for 9.2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/2/8 Robert Haas <robertmhaas@gmail.com>: > On Fri, Feb 4, 2011 at 1:54 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: >>> Hitoshi Harada <umi.tanuki@gmail.com> writes: >>>> 2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>: >>>>> Yeah, putting backend-only stuff into that header is a nonstarter. >>> >>>> Do you mean you think it' all right to define >>>> pg_cached_encoding_conversion() in pg_conversion_fn.h? >>> >>> That seems pretty random too. I still think you've designed this API >>> badly and it'd be better to avoid exposing the FmgrInfo to callers >>> by letting the function manage the cache internally. >> >> I'll try it in a few days, but only making struct that holds FmgrInfo >> in a file local like tuplestorestate comes up with so far.... > > We've been through several iterations of this patch now and haven't > come up with something committable. I think it's time to mark this > one Returned with Feedback and revisit this for 9.2. As I've been thinking these days. The design isn't fixed yet even now. Thanks for all the reviews. Regards, -- Hitoshi Harada
On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote: > The reason that we use quotes in CREATE DATABASE is that encoding > names aren't assumed to be valid SQL identifiers. If this patch isn't > following the CREATE DATABASE precedent, it's the patch that's wrong, > not CREATE DATABASE. Since encoding names are built-in and therefore well known, and the names have been aligned with the SQL standard names, which are identifiers, I don't think this argument is valid (anymore). It probably shouldn't be changed inconsistently as part of an unrelated patch, but I think the idea has merit.
Peter Eisentraut <peter_e@gmx.net> writes: > On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote: >> The reason that we use quotes in CREATE DATABASE is that encoding >> names aren't assumed to be valid SQL identifiers. If this patch isn't >> following the CREATE DATABASE precedent, it's the patch that's wrong, >> not CREATE DATABASE. > Since encoding names are built-in and therefore well known, and the > names have been aligned with the SQL standard names, which are > identifiers, I don't think this argument is valid (anymore). What about "UTF-8"? regards, tom lane
2011/2/9 Tom Lane <tgl@sss.pgh.pa.us>: > Peter Eisentraut <peter_e@gmx.net> writes: >> On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote: >>> The reason that we use quotes in CREATE DATABASE is that encoding >>> names aren't assumed to be valid SQL identifiers. If this patch isn't >>> following the CREATE DATABASE precedent, it's the patch that's wrong, >>> not CREATE DATABASE. > >> Since encoding names are built-in and therefore well known, and the >> names have been aligned with the SQL standard names, which are >> identifiers, I don't think this argument is valid (anymore). > > What about "UTF-8"? Then, quote it? db1=# set client_encoding to utf-8; ERROR: syntax error at or near "-" Regards, -- Hitoshi Harada
On tis, 2011-02-08 at 10:53 -0500, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote: > >> The reason that we use quotes in CREATE DATABASE is that encoding > >> names aren't assumed to be valid SQL identifiers. If this patch isn't > >> following the CREATE DATABASE precedent, it's the patch that's wrong, > >> not CREATE DATABASE. > > > Since encoding names are built-in and therefore well known, and the > > names have been aligned with the SQL standard names, which are > > identifiers, I don't think this argument is valid (anymore). > > What about "UTF-8"? The canonical name of that is UTF8. But you can quote it if you want to spell it differently.