Re: Copy Bulk Ignore Duplicated

Поиск
Список
Период
Сортировка
От Adrian Klaver
Тема Re: Copy Bulk Ignore Duplicated
Дата
Msg-id c051176c-3727-5fc3-fcf2-6550abe0714f@aklaver.com
обсуждение исходный текст
Ответ на Re: Copy Bulk Ignore Duplicated  (Leandro Guimarães <leo.guimaraes@gmail.com>)
Список pgsql-general
On 6/17/19 8:14 AM, Leandro Guimarães wrote:
> Hi Adrian,
> 
>     You are right, these fields are in CHECK CONSTRAiNTS and they are 
> not formally defined as Primary Keys.

Alright. Two things:

1) If you are are thinking of them as keys, why not make them a PK or a 
UNIQUE index?

2) Still not clear to me whether you are looking for duplicated 
information within a row or between rows?
To put it another way, what are the CHECK constraints doing?


> 
> Thanks!
> Leandro Guimarães
> 
> 
> 
> On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver 
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 6/14/19 7:24 PM, Leandro Guimarães wrote:
>      > Hi Tim, thanks for you answer!
>      >
>      > The columns were just examples, but let me explain the database
>      > structure, the fields in *bold are the keys*:
>      >
>      > *customer_id integer*
>      > *date_time timestamp*
>      > *indicator_id integer*
>      > *element_id integer*
>      > indicator_value double precision
> 
>     Huh, earlier you said you had a check constraint that was causing
>     issues.
> 
>     Does that also exist or where you referring to the keys above?
> 
>     Are the keys above formally defined as the PRIMARY KEY?
>     -
>      >
>      > The table is partitioned per day and customer_id (it works great)
>     the
>      > problem is just the duplicated key situation that I'm really
>     worried about.
>      >
>      > I populate the database via a Java Application with JDBC.
>      >
>      > Maybe this info could help to provide some light!
>      >
>      > Thanks Again!
>      >
>      > Leandro Guimarães
>      >
>      >
>      >
>      > On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@gmail.com
>     <mailto:theophilusx@gmail.com>
>      > <mailto:theophilusx@gmail.com <mailto:theophilusx@gmail.com>>> wrote:
>      >
>      >
>      >     Leandro Guimarães <leo.guimaraes@gmail.com
>     <mailto:leo.guimaraes@gmail.com>
>      >     <mailto:leo.guimaraes@gmail.com
>     <mailto:leo.guimaraes@gmail.com>>> writes:
>      >
>      >      > Hi,
>      >      >
>      >      >    I have a scenario with a large table and I'm trying to
>     insert
>      >     it via a
>      >      > COPY command with a csv file.
>      >      >
>      >      >    Everything works, but sometimes my source .csv file has
>      >     duplicated data
>      >      > in the previously fulfilled table. If I add a check constraint
>      >     and try to
>      >      > run the COPY command I have an error that stops the whole
>     insertion.
>      >      >
>      >      >   I've tried to put the data in a tmp table and fill the
>     main using
>      >      > distinct this way (the fields and names are just examples):
>      >      >
>      >      > INSERT INTO final_table values (name, document)
>      >      >    SELECT DISTINCT name, document
>      >      >    FROM tmp_TABLE t1
>      >      >    WHERE NOT EXISTS (
>      >      >    SELECT 1 FROM final_table t2
>      >      >    WHERE (t2.name <http://t2.name> <http://t2.name>,
>     t2.document)
>      >      >    IS NOT DISTINCT FROM (t1.name <http://t1.name>
>     <http://t1.name>, t1.document))
>      >      >
>      >      > The problem is that my final_table is a large (and
>     partitioned)
>      >     table and
>      >      > this query is taking a long time to execute.
>      >      >
>      >      > Someone have any idea (really guys anything would be
>     great) how
>      >     to solve
>      >      > this situation? I need to ignore duplicates instead to
>     have some
>      >     error.
>      >      >
>      >      > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
>      >     upgrade is not
>      >      > an option.
>      >      >
>      >
>      >     Explain plan would probably shed some light, but I suspect your
>      >     performance is being heavily hit by the sub query. Distinct is an
>      >     expensive operation and you are performing it once for every
>      >     distinct row
>      >     in your temp table.
>      >
>      >     It isn't clear what the primary key is for your final table -
>     name +
>      >     document seems suspicious given these seem to be the only two
>     columns
>      >     your inserting as well. You don't indicate what the data
>     types are
>      >     either - it document is something like 'text' then using it in a
>      >     distinct clause is likely to have huge performance impact.
>      >
>      >     The first thing I'd do is to eliminate duplicates from your
>     temp table
>      >     as a separate statement or by pre-filtering the CSV before
>     import. I
>      >     would then try something like an outer join to identify rows
>     in your
>      >     temp table which don't exist in your final table and select
>     from there
>      >     to insert into the final table. You don't really need the
>     distinct in
>      >     the sub query as all you really need to know is if (name,
>     document)
>      >     exists - it doesn't matter if more than one exists (for this
>     test).
>      >
>      >     If you really don't have something more specific for a
>     primary key,
>      >     depending on what data type 'document' is and how large it
>     is, you may
>      >     find adding a column which is a checksum of your 'document'
>     field a
>      >     useful addition. I have done this in the past where I had an
>     application
>      >     where name was not unique and we only wanted distinct
>     instances of
>      >     'document' (document was a fairly large XML document in this
>     case).
>      >
>      >     --
>      >     Tim Cross
>      >
>      >
> 
> 
>     -- 
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



В списке pgsql-general по дате отправления:

Предыдущее
От: Leandro Guimarães
Дата:
Сообщение: Re: Copy Bulk Ignore Duplicated
Следующее
От: Jesús Gómez
Дата:
Сообщение: psql timeout option