Обсуждение: Add new COPY option REJECT_LIMIT
Hi, 9e2d870 enabled the COPY command to skip soft error, and I think we can add another option which specifies the maximum tolerable number of soft errors. I remember this was discussed in [1], and feel it would be useful when loading 'dirty' data but there is a limit to how dirty it can be. Attached a patch for this. What do you think? [1] https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Вложения
On Fri, Jan 26, 2024 at 2:49 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
Hi,
9e2d870 enabled the COPY command to skip soft error, and I think we can
add another option which specifies the maximum tolerable number of soft
errors.
I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.
Attached a patch for this.
What do you think?
I'm opposed to adding this particular feature.
When implementing this kind of business rule I'd need the option to specify a percentage, not just an absolute value.
I would focus on trying to put the data required to make this kind of determination into a place where applications implementing such business rules and monitoring can readily get at it. The "ERRORS TO" and maybe a corresponding "STATS TO" option where a table can be specified for the system to place the problematic data and stats about the copy itself.
David J.
On 2024-01-27 00:20, David G. Johnston wrote: Thanks for your comments! > On Fri, Jan 26, 2024 at 2:49 AM torikoshia > <torikoshia@oss.nttdata.com> wrote: > >> Hi, >> >> 9e2d870 enabled the COPY command to skip soft error, and I think we >> can >> add another option which specifies the maximum tolerable number of >> soft >> errors. >> >> I remember this was discussed in [1], and feel it would be useful >> when >> loading 'dirty' data but there is a limit to how dirty it can be. >> >> Attached a patch for this. >> >> What do you think? > > I'm opposed to adding this particular feature. > > When implementing this kind of business rule I'd need the option to > specify a percentage, not just an absolute value. Yeah, it seems useful for some cases. Actually, Greenplum enables to specify not only the max number of bad rows but also its percentage[1]. I may be wrong, but considering some dataloaders support something like reject_limit(Redshift supports MAXERROR[2], pg_bulkload supports PARSE_ERRORS[3]), specifying the "number" of the bad row might also be useful. I think we can implement reject_limit specified by percentage simply calculating the ratio of skipped and processed at the end of CopyFrom() like this: if (cstate->opts.reject_limit > 0 && (double) skipped / (processed + skipped) > cstate->opts.reject_limit_percent) ereport(ERROR, (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), errmsg("exceeded the ratio specified by .. > I would focus on trying to put the data required to make this kind of > determination into a place where applications implementing such > business rules and monitoring can readily get at it. The "ERRORS TO" > and maybe a corresponding "STATS TO" option where a table can be > specified for the system to place the problematic data and stats about > the copy itself. It'd be nice to have such informative tables, but I believe the benefit of reject_limit is it fails the entire loading when the threshold is exceeded. I imagine when we just have error and stats information tables for COPY, users have to delete the rows when they confirmed too many errors in these tables. [1]https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-load-topics-g-handling-load-errors.html [2]https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html [3]https://ossc-db.github.io/pg_bulkload/pg_bulkload.html -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On 2024/01/26 18:49, torikoshia wrote: > Hi, > > 9e2d870 enabled the COPY command to skip soft error, and I think we can add another option which specifies the maximumtolerable number of soft errors. > > I remember this was discussed in [1], and feel it would be useful when loading 'dirty' data but there is a limit to howdirty it can be. > > Attached a patch for this. > > What do you think? The patch no longer applies cleanly to HEAD. Could you update it? I think the REJECT_LIMIT feature is useful. Allowing it to be set as either the absolute number of skipped rows or a percentageof the total input rows is a good idea. However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary. REJECT_LIMIT seems to coverthe same cases. For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and REJECT_LIMIT=0 can act like ON_ERROR=stop. Therefore, having both ON_ERROR and REJECT_LIMIT might be confusing. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2024-07-03 02:07, Fujii Masao wrote: Thanks for your comments! > On 2024/01/26 18:49, torikoshia wrote: >> Hi, >> >> 9e2d870 enabled the COPY command to skip soft error, and I think we >> can add another option which specifies the maximum tolerable number of >> soft errors. >> >> I remember this was discussed in [1], and feel it would be useful when >> loading 'dirty' data but there is a limit to how dirty it can be. >> >> Attached a patch for this. >> >> What do you think? > > The patch no longer applies cleanly to HEAD. Could you update it? I'm going to update it after discussing the option format as described below. > > I think the REJECT_LIMIT feature is useful. Allowing it to be set as > either the absolute number of skipped rows or a percentage of the > total input rows is a good idea. > > However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR > option is still necessary. REJECT_LIMIT seems to cover the same cases. > For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and > REJECT_LIMIT=0 can act like ON_ERROR=stop. I agree that it's possible to use only REJECT_LIMIT without ON_ERROR. I also think it's easy to understand that REJECT_LIMIT=0 is ON_ERROR=stop. However, expressing REJECT_LIMIT='infinity' needs some definition like "setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I think this might not so intuitive. Also, since it seems Snowflake and Redshift have both options equivalent to REJECT_LIMIT and ON_ERROR, having both of them in PostgreSQL COPY might not be surprising: - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 'SKIP_FILE_num%' | ABORT_STATEMENT"[1] - Redshift has MAXERROR and IGNOREALLERRORS options[2] BTW after seeing Snowflake makes SKIP_FILE_num one of the options of ON_ERROR, I'm a bit wondering whether REJECT_LIMIT also should be the same. [1] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table#copy-options-copyoptions [2] https://docs.aws.amazon.com/en_en/redshift/latest/dg/copy-parameters-data-load.html -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On 2024/07/04 12:05, torikoshia wrote: > I'm going to update it after discussing the option format as described below. Thanks! > I agree that it's possible to use only REJECT_LIMIT without ON_ERROR. > I also think it's easy to understand that REJECT_LIMIT=0 is ON_ERROR=stop. > However, expressing REJECT_LIMIT='infinity' needs some definition like "setting REJECT_LIMIT to -1 means 'infinity'", doesn'tit? If so, I think this might not so intuitive. How about allowing REJECT_LIMIT to accept the keywords "infinity", "unlimited", or "all" in addition to a number? This way, users can specify one of these keywords instead of -1 to ignore all errors. The server code would then internally set the REJECT_LIMIT to -1 or another appropriate value when these keywords are used, but users wouldn't need to worry about this detail. If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS might be more intuitive and easier to understand than REJECT_LIMIT. > Also, since it seems Snowflake and Redshift have both options equivalent to REJECT_LIMIT and ON_ERROR, having both of themin PostgreSQL COPY might not be surprising: > - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 'SKIP_FILE_num%' | ABORT_STATEMENT"[1] > - Redshift has MAXERROR and IGNOREALLERRORS options[2] Ok, so here's a summary of the options and their behaviors: To ignore all errors and continue to the end: - Snowflake: ON_ERROR=CONTINUE - Redshift: IGNOREALLERRORS - Postgres (with your patch): ON_ERROR=ignore - Postgres (with my idea): IGNORE_ERRORS=all To fail as soon as an error is found: - Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE - Redshift: MAXERROR=0 (default) - Postgres (with your patch): ON_ERROR=stop (default) - Postgres (with my idea): IGNORE_ERRORS=0 (default) To fail when NNN or more errors are found: - Snowflake: ON_ERROR=SKIP_FILE_NNN - Redshift: MAXERROR=NNN - Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore - Postgres (with my idea): IGNORE_ERRORS=NNN This makes me think it might be better to treat REJECT_LIMIT as an additional option for ON_ERROR=stop instead of ON_ERROR=ignore if we adopt your patch. Since ON_ERROR=stop is the default, users could set the maximum number of allowed errors by specifying only REJECT_LIMIT. Otherwise, they would need to specify both ON_ERROR=ignore and REJECT_LIMIT. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2024-07-05 12:59, Fujii Masao wrote: > On 2024/07/04 12:05, torikoshia wrote: >> I'm going to update it after discussing the option format as described >> below. > > Thanks! > >> I agree that it's possible to use only REJECT_LIMIT without ON_ERROR. >> I also think it's easy to understand that REJECT_LIMIT=0 is >> ON_ERROR=stop. >> However, expressing REJECT_LIMIT='infinity' needs some definition like >> "setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I >> think this might not so intuitive. > > How about allowing REJECT_LIMIT to accept the keywords "infinity", > "unlimited", > or "all" in addition to a number? This way, users can specify one of > these > keywords instead of -1 to ignore all errors. The server code would then > internally set the REJECT_LIMIT to -1 or another appropriate value when > these keywords are used, but users wouldn't need to worry about this > detail. Agreed. > If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS > might be more intuitive and easier to understand than REJECT_LIMIT. I feel that 'infinite' and 'unlimited' are unfamiliar values for PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would be a better parameter name as your suggestion. >> Also, since it seems Snowflake and Redshift have both options >> equivalent to REJECT_LIMIT and ON_ERROR, having both of them in >> PostgreSQL COPY might not be surprising: >> - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | >> 'SKIP_FILE_num%' | ABORT_STATEMENT"[1] >> - Redshift has MAXERROR and IGNOREALLERRORS options[2] > > Ok, so here's a summary of the options and their behaviors: > > To ignore all errors and continue to the end: > > - Snowflake: ON_ERROR=CONTINUE > - Redshift: IGNOREALLERRORS > - Postgres (with your patch): ON_ERROR=ignore > - Postgres (with my idea): IGNORE_ERRORS=all > > To fail as soon as an error is found: > > - Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE > - Redshift: MAXERROR=0 (default) > - Postgres (with your patch): ON_ERROR=stop (default) > - Postgres (with my idea): IGNORE_ERRORS=0 (default) > > To fail when NNN or more errors are found: > > - Snowflake: ON_ERROR=SKIP_FILE_NNN > - Redshift: MAXERROR=NNN > - Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore > - Postgres (with my idea): IGNORE_ERRORS=NNN Thanks for the summary. > This makes me think it might be better to treat REJECT_LIMIT as > an additional option for ON_ERROR=stop instead of ON_ERROR=ignore > if we adopt your patch. Since ON_ERROR=stop is the default, > users could set the maximum number of allowed errors by specifying > only REJECT_LIMIT. Otherwise, they would need to specify both > ON_ERROR=ignore and REJECT_LIMIT. That makes sense. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation