Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
Дата
Msg-id CA+TgmoZtiWK4zD76hXD8Pw0CuwShYEW2jtGA6G9iT3d8rfSoiw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling  (Nicolas Barbier <nicolas.barbier@gmail.com>)
Ответы Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
Список pgsql-hackers
On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:
> 2017-04-11 Robert Haas <robertmhaas@gmail.com>:
>> There's a nasty trade-off here between XID consumption (and the
>> aggressive vacuums it eventually causes) and preserving performance in
>> the face of errors - e.g. if you make k = 100,000 you consume 100x
>> fewer XIDs than if you make k = 1000, but you also have 100x the work
>> to redo (on average) every time you hit an error.
>
> You could make it dynamic: Commit the subtransaction even when not
> encountering any error after N lines (N starts out at 1), then double
> N and continue. When encountering an error, roll back the current
> subtransaction back and re-insert all the known good rows that have
> been rolled back (plus maybe the erroneous row into a separate table
> or whatever) in one new subtransaction and commit; then reset N to 1
> and continue processing the rest of the file.
>
> That would work reasonable well whenever the ratio of erroneous rows
> is not extremely high: whether the erroneous rows are all clumped
> together, entirely randomly spread out over the file, or a combination
> of both.

Right.  I wouldn't suggest the exact algorithm you proposed; I think
you ought to vary between some lower limit >1, maybe 10, and some
upper limit, maybe 1,000,000, ratcheting up and down based on how
often you hit errors in some way that might not be as simple as
doubling.  But something along those lines.

>> If the data quality is poor (say, 50% of lines have errors) it's
>> almost impossible to avoid runaway XID consumption.
>
> Yup, that seems difficult to work around with anything similar to the
> proposed. So the docs might need to suggest not to insert a 300 GB
> file with 50% erroneous lines :-).

Yep.  But it does seem reasonably likely that someone might shoot
themselves in the foot anyway.  Maybe we just live with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [HACKERS] snapbuild woes
Следующее
От: Alexander Kuzmenkov
Дата:
Сообщение: Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans