Re: Add new COPY option REJECT_LIMIT

Поиск
Список
Период
Сортировка
От torikoshia
Тема Re: Add new COPY option REJECT_LIMIT
Дата
Msg-id 5f807fcf3a36df7ba41464ab40b5c37d@oss.nttdata.com
обсуждение исходный текст
Ответ на Re: Add new COPY option REJECT_LIMIT  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Список pgsql-hackers
On Tue, Jul 23, 2024 at 1:35 PM Fujii Masao 
<masao.fujii@oss.nttdata.com> wrote:

Thanks for your review.

> On 2024/07/22 21:37, torikoshia wrote:
>> On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> 
>> wrote:
>> Thanks for the comment.
>> 
>>> In patch 0002, the ratio is calculated by the already 
>>> skipped/processed
>>> rows, but what if a user wants to copy 1000 rows, and he/she can 
>>> tolerate
>>> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad 
>>> row in the
>>> first 100 rows will fail the entire command, this might surprise the 
>>> user.
>> 
>> Since the ratio is calculated after all data is processed, the case 
>> "one bad row in the first 100 rows will fail the entire command" 
>> doesn't happen:
> 
> Yes, but is this the desired behavior when using the ratio threshold?
> I was thinking that COPY should fail once the error ratio (errors vs.
> total rows in the input file) reaches the threshold.

Users might expect like you.
However, implementing it would need a row number counting feature as you 
pointed out, and it seems an overkill.

Describing the difference between ratio and number in the manual might 
help, but
it might be better to make REJECT_LIMIT support only the number of 
errors and leave it to the user to calculate the number from the ratio.

I'd like to hear if anyone has an opinion on the need for supporting 
ratio.
I remember David prefers ratio[1].

> +      This option must be used with <literal>ON_ERROR</literal> to be 
> set to
> +      other than <literal>stop</literal>.
> 
> Regarding the ON_ERROR option, now it only has two values.
> Instead of saying "other than stop," we should simply use "ignore"
> for clarity and intuition?

I'll Modify it.
The reason for the roundabout wording was the expectation that on_error 
would support values other than these in the future, but as you point 
out, there are currently only two.


> +      When specified <literal>INFINITY</literal>, 
> <command>COPY</command> ignores all
> +      the errors. This is a synonym for <literal>ON_ERROR</literal> 
> <literal>ignore</literal>.
> 
> For the INFINITY value, the description is a bit unclear.
> As I understand it, INFINITY is the default for REJECT_LIMIT.
> So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
> COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is 
> used.
> 
> 
> In line with my previous suggestion, if we support only REJECT_LIMIT
> without ON_ERROR, having INFINITY makes sense. However,
> with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
> Users can just set ON_ERROR=ignore to ignore all errors,
> making INFINITY unnecessary for REJECT_LIMIT. This is open for
> discussion, but I believe it's better to remove INFINITY from
> the REJECT_LIMIT options.

Agreed.
Unless there are opposing opinions, I'm going to remove 'INFINITY' in 
the next patch.

> +               else if (strcmp(defel->defname, "reject_limit") == 0)
> +               {
> +                       if (reject_limit_specified)
> +                               errorConflictingDefElem(defel, pstate);
> +                       if (!opts_out->on_error)
> +                               ereport(ERROR,
> +                                               
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                                errmsg("REJECT_LIMIT 
> requires ON_ERROR to be set to other than stop")));
> 
> Using "ignore" instead of "other than stop" in the error message
> is clearer and more intuitive.

Agreed.


> Checking if ON_ERROR and REJECT_LIMIT are specified should be
> done after the foreach() processing of the options. Otherwise,
> if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can 
> occur.
> 
> ---------------
> =# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
> ERROR:  REJECT_LIMIT requires ON_ERROR to be set to other than stop
> ---------------

Ugh, I'll modify it.

> +                               ereport(ERROR,
> +                                               
> (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> +                                                errmsg("exceeded the 
> number specified by REJECT_LIMIT \"%lld\"",
> +                                                               (long 
> long) cstate->opts.reject_limits.num_err)));
> 
> The error message isn't clear about what exceeded REJECT_LIMIT.
> How about saying "skipped more than REJECT_LIMIT rows" or something 
> instead?

Agreed.

> +/*
> + * A struct to hold reject_limit options, in a parsed form.
> + * More values to be added in another patch.
> + */
> 
> The latter comment doesn't seem necessary or helpful.

Agreed.

[1] 
https://www.postgresql.org/message-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_%3DzuWjkz1%2B25Nd8bpsrDkx5Q%40mail.gmail.com

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Christensen
Дата:
Сообщение: Re: [PATCH] GROUP BY ALL
Следующее
От: torikoshia
Дата:
Сообщение: Re: Add new COPY option REJECT_LIMIT