Re: Copy Bulk Ignore Duplicated

Поиск
Список
Период
Сортировка
От Leandro Guimarães
Тема Re: Copy Bulk Ignore Duplicated
Дата
Msg-id CAJV35FN428XGBeNYUf_y5YnydTG4mW2Chk1w9_fLUMd-tJaKHA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Copy Bulk Ignore Duplicated  (Adrian Klaver <adrian.klaver@aklaver.com>)
Ответы Re: Copy Bulk Ignore Duplicated  (Adrian Klaver <adrian.klaver@aklaver.com>)
Список pgsql-general
Hi Adrian,

   You are right, these fields are in CHECK CONSTRAiNTS and they are not formally defined as Primary Keys. 

Thanks!
Leandro Guimarães



On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/14/19 7:24 PM, Leandro Guimarães wrote:
> Hi Tim, thanks for you answer!
>
> The columns were just examples, but let me explain the database
> structure, the fields in *bold are the keys*:
>
> *customer_id integer*
> *date_time timestamp*
> *indicator_id integer*
> *element_id integer*
> indicator_value double precision

Huh, earlier you said you had a check constraint that was causing issues.

Does that also exist or where you referring to the keys above?

Are the keys above formally defined as the PRIMARY KEY?
-
>
> The table is partitioned per day and customer_id (it works great) the
> problem is just the duplicated key situation that I'm really worried about.
>
> I populate the database via a Java Application with JDBC.
>
> Maybe this info could help to provide some light!
>
> Thanks Again!
>
> Leandro Guimarães
>
>
>
> On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@gmail.com
> <mailto:theophilusx@gmail.com>> wrote:
>
>
>     Leandro Guimarães <leo.guimaraes@gmail.com
>     <mailto:leo.guimaraes@gmail.com>> writes:
>
>      > Hi,
>      >
>      >    I have a scenario with a large table and I'm trying to insert
>     it via a
>      > COPY command with a csv file.
>      >
>      >    Everything works, but sometimes my source .csv file has
>     duplicated data
>      > in the previously fulfilled table. If I add a check constraint
>     and try to
>      > run the COPY command I have an error that stops the whole insertion.
>      >
>      >   I've tried to put the data in a tmp table and fill the main using
>      > distinct this way (the fields and names are just examples):
>      >
>      > INSERT INTO final_table values (name, document)
>      >    SELECT DISTINCT name, document
>      >    FROM tmp_TABLE t1
>      >    WHERE NOT EXISTS (
>      >    SELECT 1 FROM final_table t2
>      >    WHERE (t2.name <http://t2.name>, t2.document)
>      >    IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))
>      >
>      > The problem is that my final_table is a large (and partitioned)
>     table and
>      > this query is taking a long time to execute.
>      >
>      > Someone have any idea (really guys anything would be great) how
>     to solve
>      > this situation? I need to ignore duplicates instead to have some
>     error.
>      >
>      > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
>     upgrade is not
>      > an option.
>      >
>
>     Explain plan would probably shed some light, but I suspect your
>     performance is being heavily hit by the sub query. Distinct is an
>     expensive operation and you are performing it once for every
>     distinct row
>     in your temp table.
>
>     It isn't clear what the primary key is for your final table - name +
>     document seems suspicious given these seem to be the only two columns
>     your inserting as well. You don't indicate what the data types are
>     either - it document is something like 'text' then using it in a
>     distinct clause is likely to have huge performance impact.
>
>     The first thing I'd do is to eliminate duplicates from your temp table
>     as a separate statement or by pre-filtering the CSV before import. I
>     would then try something like an outer join to identify rows in your
>     temp table which don't exist in your final table and select from there
>     to insert into the final table. You don't really need the distinct in
>     the sub query as all you really need to know is if (name, document)
>     exists - it doesn't matter if more than one exists (for this test).
>
>     If you really don't have something more specific for a primary key,
>     depending on what data type 'document' is and how large it is, you may
>     find adding a column which is a checksum of your 'document' field a
>     useful addition. I have done this in the past where I had an application
>     where name was not unique and we only wanted distinct instances of
>     'document' (document was a fairly large XML document in this case).
>
>     --
>     Tim Cross
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

В списке pgsql-general по дате отправления:

Предыдущее
От: Adrian Klaver
Дата:
Сообщение: Re: bug regclass::oid
Следующее
От: Adrian Klaver
Дата:
Сообщение: Re: Copy Bulk Ignore Duplicated