Re: Copy Bulk Ignore Duplicated

Поиск

Список

Период

Сортировка

От	Adrian Klaver
Тема	Re: Copy Bulk Ignore Duplicated
Дата	15 июня 2019 г. 16:44:55
Msg-id	c426e95f-3144-410b-413c-4c6b016d7d2b@aklaver.com обсуждение исходный текст
Ответ на	Re: Copy Bulk Ignore Duplicated (Leandro Guimarães <leo.guimaraes@gmail.com>)
Ответы	Re: Copy Bulk Ignore Duplicated (Leandro Guimarães <leo.guimaraes@gmail.com>)
Список	pgsql-general

Дерево обсуждения

On 6/14/19 7:24 PM, Leandro Guimarães wrote:
> Hi Tim, thanks for you answer!
> 
> The columns were just examples, but let me explain the database 
> structure, the fields in *bold are the keys*:
> 
> *customer_id integer*
> *date_time timestamp*
> *indicator_id integer*
> *element_id integer*
> indicator_value double precision

Huh, earlier you said you had a check constraint that was causing issues.

Does that also exist or where you referring to the keys above?

Are the keys above formally defined as the PRIMARY KEY?
-
> 
> The table is partitioned per day and customer_id (it works great) the 
> problem is just the duplicated key situation that I'm really worried about.
> 
> I populate the database via a Java Application with JDBC.
> 
> Maybe this info could help to provide some light!
> 
> Thanks Again!
> 
> Leandro Guimarães
> 
> 
> 
> On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@gmail.com 
> <mailto:theophilusx@gmail.com>> wrote:
> 
> 
>     Leandro Guimarães <leo.guimaraes@gmail.com
>     <mailto:leo.guimaraes@gmail.com>> writes:
> 
>      > Hi,
>      >
>      >    I have a scenario with a large table and I'm trying to insert
>     it via a
>      > COPY command with a csv file.
>      >
>      >    Everything works, but sometimes my source .csv file has
>     duplicated data
>      > in the previously fulfilled table. If I add a check constraint
>     and try to
>      > run the COPY command I have an error that stops the whole insertion.
>      >
>      >   I've tried to put the data in a tmp table and fill the main using
>      > distinct this way (the fields and names are just examples):
>      >
>      > INSERT INTO final_table values (name, document)
>      >    SELECT DISTINCT name, document
>      >    FROM tmp_TABLE t1
>      >    WHERE NOT EXISTS (
>      >    SELECT 1 FROM final_table t2
>      >    WHERE (t2.name <http://t2.name>, t2.document)
>      >    IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))
>      >
>      > The problem is that my final_table is a large (and partitioned)
>     table and
>      > this query is taking a long time to execute.
>      >
>      > Someone have any idea (really guys anything would be great) how
>     to solve
>      > this situation? I need to ignore duplicates instead to have some
>     error.
>      >
>      > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
>     upgrade is not
>      > an option.
>      >
> 
>     Explain plan would probably shed some light, but I suspect your
>     performance is being heavily hit by the sub query. Distinct is an
>     expensive operation and you are performing it once for every
>     distinct row
>     in your temp table.
> 
>     It isn't clear what the primary key is for your final table - name +
>     document seems suspicious given these seem to be the only two columns
>     your inserting as well. You don't indicate what the data types are
>     either - it document is something like 'text' then using it in a
>     distinct clause is likely to have huge performance impact.
> 
>     The first thing I'd do is to eliminate duplicates from your temp table
>     as a separate statement or by pre-filtering the CSV before import. I
>     would then try something like an outer join to identify rows in your
>     temp table which don't exist in your final table and select from there
>     to insert into the final table. You don't really need the distinct in
>     the sub query as all you really need to know is if (name, document)
>     exists - it doesn't matter if more than one exists (for this test).
> 
>     If you really don't have something more specific for a primary key,
>     depending on what data type 'document' is and how large it is, you may
>     find adding a column which is a checksum of your 'document' field a
>     useful addition. I have done this in the past where I had an application
>     where name was not unique and we only wanted distinct instances of
>     'document' (document was a fairly large XML document in this case).
> 
>     --
>     Tim Cross
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

В списке pgsql-general по дате отправления:

Предыдущее

От: Dave Cramer
Дата: 15 июня 2019 г., 13:47:50
Сообщение: Re: arrays of composite types, and client drivers like JDBC

Следующее

От: Jean Louis
Дата: 15 июня 2019 г., 17:20:38
Сообщение: Re: how to concat/concat_ws all fields without braces

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Copy Bulk Ignore Duplicated

Предыдущее

Следующее