Обсуждение: Add new COPY option REJECT_LIMIT

Поиск
Список
Период
Сортировка

Add new COPY option REJECT_LIMIT

От
torikoshia
Дата:
Hi,

9e2d870 enabled the COPY command to skip soft error, and I think we can 
add another option which specifies the maximum tolerable number of soft 
errors.

I remember this was discussed in [1], and feel it would be useful when 
loading 'dirty' data but there is a limit to how dirty it can be.

Attached a patch for this.

What do you think?


[1] 
https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Вложения

Re: Add new COPY option REJECT_LIMIT

От
"David G. Johnston"
Дата:
On Fri, Jan 26, 2024 at 2:49 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
Hi,

9e2d870 enabled the COPY command to skip soft error, and I think we can
add another option which specifies the maximum tolerable number of soft
errors.

I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.

Attached a patch for this.

What do you think?


I'm opposed to adding this particular feature.

When implementing this kind of business rule I'd need the option to specify a percentage, not just an absolute value.

I would focus on trying to put the data required to make this kind of determination into a place where applications implementing such business rules and monitoring can readily get at it.  The "ERRORS TO" and maybe a corresponding "STATS TO" option where a table can be specified for the system to place the problematic data and stats about the copy itself.

David J.

Re: Add new COPY option REJECT_LIMIT

От
torikoshia
Дата:
On 2024-01-27 00:20, David G. Johnston wrote:

Thanks for your comments!

> On Fri, Jan 26, 2024 at 2:49 AM torikoshia
> <torikoshia@oss.nttdata.com> wrote:
> 
>> Hi,
>> 
>> 9e2d870 enabled the COPY command to skip soft error, and I think we
>> can
>> add another option which specifies the maximum tolerable number of
>> soft
>> errors.
>> 
>> I remember this was discussed in [1], and feel it would be useful
>> when
>> loading 'dirty' data but there is a limit to how dirty it can be.
>> 
>> Attached a patch for this.
>> 
>> What do you think?
> 
> I'm opposed to adding this particular feature.
> 
> When implementing this kind of business rule I'd need the option to
> specify a percentage, not just an absolute value.

Yeah, it seems useful for some cases.
Actually, Greenplum enables to specify not only the max number of bad 
rows but also its percentage[1].

I may be wrong, but considering some dataloaders support something like 
reject_limit(Redshift supports MAXERROR[2], pg_bulkload supports 
PARSE_ERRORS[3]), specifying the "number" of the bad row might also be 
useful.

I think we can implement reject_limit specified by percentage simply 
calculating the ratio of skipped and processed at the end of CopyFrom() 
like this:

   if (cstate->opts.reject_limit > 0 && (double) skipped / (processed + 
skipped) > cstate->opts.reject_limit_percent)
                   ereport(ERROR,
                                   
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
                                    errmsg("exceeded the ratio specified 
by ..


> I would focus on trying to put the data required to make this kind of
> determination into a place where applications implementing such
> business rules and monitoring can readily get at it.  The "ERRORS TO"
> and maybe a corresponding "STATS TO" option where a table can be
> specified for the system to place the problematic data and stats about
> the copy itself.

It'd be nice to have such informative tables, but I believe the benefit 
of reject_limit is it fails the entire loading when the threshold is 
exceeded.
I imagine when we just have error and stats information tables for COPY, 
users have to delete the rows when they confirmed too many errors in 
these tables.



[1]https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-load-topics-g-handling-load-errors.html
[2]https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html
[3]https://ossc-db.github.io/pg_bulkload/pg_bulkload.html

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Add new COPY option REJECT_LIMIT

От
Fujii Masao
Дата:

On 2024/01/26 18:49, torikoshia wrote:
> Hi,
> 
> 9e2d870 enabled the COPY command to skip soft error, and I think we can add another option which specifies the
maximumtolerable number of soft errors.
 
> 
> I remember this was discussed in [1], and feel it would be useful when loading 'dirty' data but there is a limit to
howdirty it can be.
 
> 
> Attached a patch for this.
> 
> What do you think?

The patch no longer applies cleanly to HEAD. Could you update it?

I think the REJECT_LIMIT feature is useful. Allowing it to be set as either the absolute number of skipped rows or a
percentageof the total input rows is a good idea.
 

However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary. REJECT_LIMIT seems to
coverthe same cases. For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and REJECT_LIMIT=0 can act like
ON_ERROR=stop.

Therefore, having both ON_ERROR and REJECT_LIMIT might be confusing.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Add new COPY option REJECT_LIMIT

От
torikoshia
Дата:
On 2024-07-03 02:07, Fujii Masao wrote:
Thanks for your comments!

> On 2024/01/26 18:49, torikoshia wrote:
>> Hi,
>> 
>> 9e2d870 enabled the COPY command to skip soft error, and I think we 
>> can add another option which specifies the maximum tolerable number of 
>> soft errors.
>> 
>> I remember this was discussed in [1], and feel it would be useful when 
>> loading 'dirty' data but there is a limit to how dirty it can be.
>> 
>> Attached a patch for this.
>> 
>> What do you think?
> 
> The patch no longer applies cleanly to HEAD. Could you update it?

I'm going to update it after discussing the option format as described 
below.

> 
> I think the REJECT_LIMIT feature is useful. Allowing it to be set as
> either the absolute number of skipped rows or a percentage of the
> total input rows is a good idea.
> 
> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR
> option is still necessary. REJECT_LIMIT seems to cover the same cases.
> For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and
> REJECT_LIMIT=0 can act like ON_ERROR=stop.

I agree that it's possible to use only REJECT_LIMIT without ON_ERROR.
I also think it's easy to understand that REJECT_LIMIT=0 is 
ON_ERROR=stop.
However, expressing REJECT_LIMIT='infinity' needs some definition like 
"setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I 
think this might not so intuitive.

Also, since it seems Snowflake and Redshift have both options equivalent 
to REJECT_LIMIT and ON_ERROR, having both of them in PostgreSQL COPY 
might not be surprising:
- Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 
'SKIP_FILE_num%' | ABORT_STATEMENT"[1]
- Redshift has MAXERROR and IGNOREALLERRORS options[2]

BTW after seeing Snowflake makes SKIP_FILE_num one of the options of 
ON_ERROR,  I'm a bit wondering whether REJECT_LIMIT also should be the 
same.


[1] 
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table#copy-options-copyoptions
[2] 
https://docs.aws.amazon.com/en_en/redshift/latest/dg/copy-parameters-data-load.html

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Add new COPY option REJECT_LIMIT

От
Fujii Masao
Дата:

On 2024/07/04 12:05, torikoshia wrote:
> I'm going to update it after discussing the option format as described below.

Thanks!

> I agree that it's possible to use only REJECT_LIMIT without ON_ERROR.
> I also think it's easy to understand that REJECT_LIMIT=0 is ON_ERROR=stop.
> However, expressing REJECT_LIMIT='infinity' needs some definition like "setting REJECT_LIMIT to -1 means 'infinity'",
doesn'tit? If so, I think this might not so intuitive.
 

How about allowing REJECT_LIMIT to accept the keywords "infinity", "unlimited",
or "all" in addition to a number? This way, users can specify one of these
keywords instead of -1 to ignore all errors. The server code would then
internally set the REJECT_LIMIT to -1 or another appropriate value when
these keywords are used, but users wouldn't need to worry about this detail.

If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS
might be more intuitive and easier to understand than REJECT_LIMIT.


> Also, since it seems Snowflake and Redshift have both options equivalent to REJECT_LIMIT and ON_ERROR, having both of
themin PostgreSQL COPY might not be surprising:
 
> - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 'SKIP_FILE_num%' | ABORT_STATEMENT"[1]
> - Redshift has MAXERROR and IGNOREALLERRORS options[2]

Ok, so here's a summary of the options and their behaviors:

To ignore all errors and continue to the end:

- Snowflake: ON_ERROR=CONTINUE
- Redshift: IGNOREALLERRORS
- Postgres (with your patch): ON_ERROR=ignore
- Postgres (with my idea): IGNORE_ERRORS=all

To fail as soon as an error is found:

- Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE
- Redshift: MAXERROR=0 (default)
- Postgres (with your patch): ON_ERROR=stop (default)
- Postgres (with my idea): IGNORE_ERRORS=0 (default)

To fail when NNN or more errors are found:

- Snowflake: ON_ERROR=SKIP_FILE_NNN
- Redshift: MAXERROR=NNN
- Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore
- Postgres (with my idea): IGNORE_ERRORS=NNN

This makes me think it might be better to treat REJECT_LIMIT as
an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
if we adopt your patch. Since ON_ERROR=stop is the default,
users could set the maximum number of allowed errors by specifying
only REJECT_LIMIT. Otherwise, they would need to specify both
ON_ERROR=ignore and REJECT_LIMIT.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Add new COPY option REJECT_LIMIT

От
torikoshia
Дата:
On 2024-07-05 12:59, Fujii Masao wrote:
> On 2024/07/04 12:05, torikoshia wrote:
>> I'm going to update it after discussing the option format as described 
>> below.
> 
> Thanks!
> 
>> I agree that it's possible to use only REJECT_LIMIT without ON_ERROR.
>> I also think it's easy to understand that REJECT_LIMIT=0 is 
>> ON_ERROR=stop.
>> However, expressing REJECT_LIMIT='infinity' needs some definition like 
>> "setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I 
>> think this might not so intuitive.
> 
> How about allowing REJECT_LIMIT to accept the keywords "infinity", 
> "unlimited",
> or "all" in addition to a number? This way, users can specify one of 
> these
> keywords instead of -1 to ignore all errors. The server code would then
> internally set the REJECT_LIMIT to -1 or another appropriate value when
> these keywords are used, but users wouldn't need to worry about this 
> detail.

Agreed.

> If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS
> might be more intuitive and easier to understand than REJECT_LIMIT.

I feel that 'infinite' and 'unlimited' are unfamiliar values for 
PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would 
be a better parameter name as your suggestion.

>> Also, since it seems Snowflake and Redshift have both options 
>> equivalent to REJECT_LIMIT and ON_ERROR, having both of them in 
>> PostgreSQL COPY might not be surprising:
>> - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 
>> 'SKIP_FILE_num%' | ABORT_STATEMENT"[1]
>> - Redshift has MAXERROR and IGNOREALLERRORS options[2]
> 
> Ok, so here's a summary of the options and their behaviors:
> 
> To ignore all errors and continue to the end:
> 
> - Snowflake: ON_ERROR=CONTINUE
> - Redshift: IGNOREALLERRORS
> - Postgres (with your patch): ON_ERROR=ignore
> - Postgres (with my idea): IGNORE_ERRORS=all
> 
> To fail as soon as an error is found:
> 
> - Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE
> - Redshift: MAXERROR=0 (default)
> - Postgres (with your patch): ON_ERROR=stop (default)
> - Postgres (with my idea): IGNORE_ERRORS=0 (default)
> 
> To fail when NNN or more errors are found:
> 
> - Snowflake: ON_ERROR=SKIP_FILE_NNN
> - Redshift: MAXERROR=NNN
> - Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore
> - Postgres (with my idea): IGNORE_ERRORS=NNN

Thanks for the summary.

> This makes me think it might be better to treat REJECT_LIMIT as
> an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
> if we adopt your patch. Since ON_ERROR=stop is the default,
> users could set the maximum number of allowed errors by specifying
> only REJECT_LIMIT. Otherwise, they would need to specify both
> ON_ERROR=ignore and REJECT_LIMIT.

That makes sense.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation