Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

Поиск
Список
Период
Сортировка
От Pavel Stehule
Тема Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Дата
Msg-id CAFj8pRC96C+C6Gg8OHkv26YZMQsrRJ9YtJhBKrrrje1hc2yrcw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)  (Damir Belyalov <dam.bel07@gmail.com>)
Ответы Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)  (Damir Belyalov <dam.bel07@gmail.com>)
Список pgsql-hackers
Hi

so 18. 12. 2021 v 9:55 odesílatel Damir Belyalov <dam.bel07@gmail.com> napsal:
Hello.

Wrote a patch implementing COPY with ignoring errors in rows using block subtransactions.

It is great so you are working on this patch. Unfortunately, I am afraid this simple design is not optimal. Using subtransaction for every row has too big overhead. I think it should use subtransaction for blocks of rows (1000 rows), and only when there is an exception, then it should replay inserts in subtransaction per row. You should check performance overhead.

Regards

Pavel



Syntax: COPY [table] FROM [file/stdin] WITH IGNORE_ERROS;

Examples:
CREATE TABLE check_ign_err (n int, m int, k int);
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2 2 2
3 3 3
\.
WARNING:  COPY check_ign_err, line 2: "2 2 2 2"
SELECT * FROM check_ign_err;
 n | m | k
---+---+---
 1 | 1 | 1
 3 | 3 | 3
(2 rows)

##################################################

TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2
3 3 3
\.
WARNING:  COPY check_ign_err, line 2: "2 2"
SELECT * FROM check_ign_err;
 n | m | k
---+---+---
 1 | 1 | 1
 3 | 3 | 3
(2 rows)

##################################################

TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 a 2
3 3 3
\.
WARNING:  COPY check_ign_err, line 2, column m: "a"
SELECT * FROM check_ign_err;
 n | m | k
---+---+---
 1 | 1 | 1
 3 | 3 | 3
(2 rows)



Regards, Damir

пт, 10 дек. 2021 г. в 21:48, Pavel Stehule <pavel.stehule@gmail.com>:


2014-12-26 11:41 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:


2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash@commandprompt.com>:
Trent Shipley <trent_shipley@qwest.net> writes:

> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah.  It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause.  This
> would help with the 4G subtransaction limit.  The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state.  There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful.   If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit.  NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions.  (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.

Hello,

Attached is a proof of concept patch for this TODO item.  There is no
docs yet, I just wanted to know if approach is sane.

The added syntax is like the following:

  COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]

The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work.  The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.

Example in action:

postgres=# \d test_copy2
  Table "public.test_copy2"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |
 val    | integer |

postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE:  missing data for column "val"
CONTEXT:  COPY test_copy2, line 1: "1"
2
NOTICE:  missing data for column "val"
CONTEXT:  COPY test_copy2, line 2: "2"
3
NOTICE:  missing data for column "val"
CONTEXT:  COPY test_copy2, line 3: "3"
NOTICE:  total exceptions ignored: 3

postgres=# \d test_copy1
  Table "public.test_copy1"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer | not null

postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#

Limited performance testing shows no significant difference between
error-catching and plain code path.  For example, timing

  copy test_copy1 from program 'seq 1000000' [exceptions to stdout]

shows similar numbers with or without the added "exceptions to" clause.

Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds.  Any
example of what would be not properly rolled back by just PG_TRY?

this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaks

you can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimal

when block fails, then you can use row level subtransaction - it works well when you expect almost correct data.

Two years ago I wrote a extension that did it - but I have not time to finish it and push to upstream.

Regards

Pavel
 

Regards

Pavel
 

Happy hacking!
--
Alex



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: sequences vs. synchronous replication
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: Schema variables - new implementation for Postgres 15