Re: Copy Bulk Ignore Duplicated
От | Adrian Klaver |
---|---|
Тема | Re: Copy Bulk Ignore Duplicated |
Дата | |
Msg-id | c051176c-3727-5fc3-fcf2-6550abe0714f@aklaver.com обсуждение исходный текст |
Ответ на | Re: Copy Bulk Ignore Duplicated (Leandro Guimarães <leo.guimaraes@gmail.com>) |
Список | pgsql-general |
On 6/17/19 8:14 AM, Leandro Guimarães wrote: > Hi Adrian, > > You are right, these fields are in CHECK CONSTRAiNTS and they are > not formally defined as Primary Keys. Alright. Two things: 1) If you are are thinking of them as keys, why not make them a PK or a UNIQUE index? 2) Still not clear to me whether you are looking for duplicated information within a row or between rows? To put it another way, what are the CHECK constraints doing? > > Thanks! > Leandro Guimarães > > > > On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver > <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote: > > On 6/14/19 7:24 PM, Leandro Guimarães wrote: > > Hi Tim, thanks for you answer! > > > > The columns were just examples, but let me explain the database > > structure, the fields in *bold are the keys*: > > > > *customer_id integer* > > *date_time timestamp* > > *indicator_id integer* > > *element_id integer* > > indicator_value double precision > > Huh, earlier you said you had a check constraint that was causing > issues. > > Does that also exist or where you referring to the keys above? > > Are the keys above formally defined as the PRIMARY KEY? > - > > > > The table is partitioned per day and customer_id (it works great) > the > > problem is just the duplicated key situation that I'm really > worried about. > > > > I populate the database via a Java Application with JDBC. > > > > Maybe this info could help to provide some light! > > > > Thanks Again! > > > > Leandro Guimarães > > > > > > > > On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@gmail.com > <mailto:theophilusx@gmail.com> > > <mailto:theophilusx@gmail.com <mailto:theophilusx@gmail.com>>> wrote: > > > > > > Leandro Guimarães <leo.guimaraes@gmail.com > <mailto:leo.guimaraes@gmail.com> > > <mailto:leo.guimaraes@gmail.com > <mailto:leo.guimaraes@gmail.com>>> writes: > > > > > Hi, > > > > > > I have a scenario with a large table and I'm trying to > insert > > it via a > > > COPY command with a csv file. > > > > > > Everything works, but sometimes my source .csv file has > > duplicated data > > > in the previously fulfilled table. If I add a check constraint > > and try to > > > run the COPY command I have an error that stops the whole > insertion. > > > > > > I've tried to put the data in a tmp table and fill the > main using > > > distinct this way (the fields and names are just examples): > > > > > > INSERT INTO final_table values (name, document) > > > SELECT DISTINCT name, document > > > FROM tmp_TABLE t1 > > > WHERE NOT EXISTS ( > > > SELECT 1 FROM final_table t2 > > > WHERE (t2.name <http://t2.name> <http://t2.name>, > t2.document) > > > IS NOT DISTINCT FROM (t1.name <http://t1.name> > <http://t1.name>, t1.document)) > > > > > > The problem is that my final_table is a large (and > partitioned) > > table and > > > this query is taking a long time to execute. > > > > > > Someone have any idea (really guys anything would be > great) how > > to solve > > > this situation? I need to ignore duplicates instead to > have some > > error. > > > > > > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and > > upgrade is not > > > an option. > > > > > > > Explain plan would probably shed some light, but I suspect your > > performance is being heavily hit by the sub query. Distinct is an > > expensive operation and you are performing it once for every > > distinct row > > in your temp table. > > > > It isn't clear what the primary key is for your final table - > name + > > document seems suspicious given these seem to be the only two > columns > > your inserting as well. You don't indicate what the data > types are > > either - it document is something like 'text' then using it in a > > distinct clause is likely to have huge performance impact. > > > > The first thing I'd do is to eliminate duplicates from your > temp table > > as a separate statement or by pre-filtering the CSV before > import. I > > would then try something like an outer join to identify rows > in your > > temp table which don't exist in your final table and select > from there > > to insert into the final table. You don't really need the > distinct in > > the sub query as all you really need to know is if (name, > document) > > exists - it doesn't matter if more than one exists (for this > test). > > > > If you really don't have something more specific for a > primary key, > > depending on what data type 'document' is and how large it > is, you may > > find adding a column which is a checksum of your 'document' > field a > > useful addition. I have done this in the past where I had an > application > > where name was not unique and we only wanted distinct > instances of > > 'document' (document was a fairly large XML document in this > case). > > > > -- > > Tim Cross > > > > > > > -- > Adrian Klaver > adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com> > -- Adrian Klaver adrian.klaver@aklaver.com
В списке pgsql-general по дате отправления: