Re: Copy Bulk Ignore Duplicated

Поиск
Список
Период
Сортировка
От Tim Cross
Тема Re: Copy Bulk Ignore Duplicated
Дата
Msg-id 87o92zpzba.fsf@gmail.com
обсуждение исходный текст
Ответ на Copy Bulk Ignore Duplicated  (Leandro Guimarães <leo.guimaraes@gmail.com>)
Ответы Re: Copy Bulk Ignore Duplicated  (Leandro Guimarães <leo.guimaraes@gmail.com>)
Список pgsql-general
Leandro Guimarães <leo.guimaraes@gmail.com> writes:

> Hi,
>
>    I have a scenario with a large table and I'm trying to insert it via a
> COPY command with a csv file.
>
>    Everything works, but sometimes my source .csv file has duplicated data
> in the previously fulfilled table. If I add a check constraint and try to
> run the COPY command I have an error that stops the whole insertion.
>
>   I've tried to put the data in a tmp table and fill the main using
> distinct this way (the fields and names are just examples):
>
> INSERT INTO final_table values (name, document)
>    SELECT DISTINCT name, document
>    FROM tmp_TABLE t1
>    WHERE NOT EXISTS (
>    SELECT 1 FROM final_table t2
>    WHERE (t2.name, t2.document)
>    IS NOT DISTINCT FROM (t1.name, t1.document))
>
> The problem is that my final_table is a large (and partitioned) table and
> this query is taking a long time to execute.
>
> Someone have any idea (really guys anything would be great) how to solve
> this situation? I need to ignore duplicates instead to have some error.
>
> I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is not
> an option.
>

Explain plan would probably shed some light, but I suspect your
performance is being heavily hit by the sub query. Distinct is an
expensive operation and you are performing it once for every distinct row
in your temp table.

It isn't clear what the primary key is for your final table - name +
document seems suspicious given these seem to be the only two columns
your inserting as well. You don't indicate what the data types are
either - it document is something like 'text' then using it in a
distinct clause is likely to have huge performance impact.

The first thing I'd do is to eliminate duplicates from your temp table
as a separate statement or by pre-filtering the CSV before import. I
would then try something like an outer join to identify rows in your
temp table which don't exist in your final table and select from there
to insert into the final table. You don't really need the distinct in
the sub query as all you really need to know is if (name, document)
exists - it doesn't matter if more than one exists (for this test).

If you really don't have something more specific for a primary key,
depending on what data type 'document' is and how large it is, you may
find adding a column which is a checksum of your 'document' field a
useful addition. I have done this in the past where I had an application
where name was not unique and we only wanted distinct instances of
'document' (document was a fairly large XML document in this case).

--
Tim Cross



В списке pgsql-general по дате отправления:

Предыдущее
От: Adrian Klaver
Дата:
Сообщение: Re: Copy Bulk Ignore Duplicated
Следующее
От: Leandro Guimarães
Дата:
Сообщение: Re: Copy Bulk Ignore Duplicated