Обсуждение: COPY from .csv File and Remove Duplicates

Поиск

Список

Период

Сортировка

COPY from .csv File and Remove Duplicates

От

Rich Shepard

Дата:

11 августа 2011 г., 23:13:36

   A table has a sequence to generate a primary key for inserted records with
NULLs in that column.

   I have a .csv file of approximately 10k rows to copy into this table. My
two questions which have not been answered by reference to my postgres
reference book or Google searches are:

   1) Will the sequence automatically add the nextval() to each new record as
the copy command runs?

   2) Many of these rows almost certainly are already in the table. I would
like to remove duplicates either during the COPY command or immediately
after. I'm considering copying the new data into a clone of the table then
running a SELECT to add only those rows in the new cloned table to the
existing table.

   Suggestions on getting these data in without duplicating existing rows
will be very helpful.

TIA,

Rich

Re: COPY from .csv File and Remove Duplicates

От

David Johnston

Дата:

11 августа 2011 г., 23:50:29

On Aug 11, 2011, at 19:13, Rich Shepard <rshepard@appl-ecosys.com> wrote:

>  A table has a sequence to generate a primary key for inserted records with
> NULLs in that column.
>
>  I have a .csv file of approximately 10k rows to copy into this table. My
> two questions which have not been answered by reference to my postgres
> reference book or Google searches are:
>
>  1) Will the sequence automatically add the nextval() to each new record as
> the copy command runs?
>
>  2) Many of these rows almost certainly are already in the table. I would
> like to remove duplicates either during the COPY command or immediately
> after. I'm considering copying the new data into a clone of the table then
> running a SELECT to add only those rows in the new cloned table to the
> existing table.
>
>  Suggestions on getting these data in without duplicating existing rows
> will be very helpful.
>

If you have duplicates with matching real keys inserting into a staging table and then moving new records to the final
tableis your best option (in general it is better to do a two-step with a staging table since you can readily use
Postgresqlto perform any intermediate translations)  As for the import itself, I believe if the column with the
sequenceis present in the csv the sequence will not be used and, if no value is present, a null will be stored for that
column- causing any not-null constraint to throw an error.  In this case I would just import the data to a staging
tablewithout any kind of artificial key, just the true key, and then during the merge with the live table you simply
omitthe pk field from the insert statement and the sequence will kick in at that point. 

David J.

Re: COPY from .csv File and Remove Duplicates

От

Rich Shepard

Дата:

12 августа 2011 г., 00:00:43

On Thu, 11 Aug 2011, David Johnston wrote:

> If you have duplicates with matching real keys inserting into a staging
> table and then moving new records to the final table is your best option
> (in general it is better to do a two-step with a staging table since you
> can readily use Postgresql to perform any intermediate translations) As
> for the import itself,

David,

   I presume what you call a staging table is what I refer to as a copy of
the main table, but with no key attribute.

   Writing the SELECT statement to delete from the staging table those rows
that already exist in the main table is where I'm open to suggestions.

> In this case I would just import the data to a staging table without any
> kind of artificial key, just the true key,

   There is no true key, only an artificial key so I can ensure that rows are
unique. That's in the main table with the 50K rows. No key column in the
.csv file.

Thanks,

Rich

Re: COPY from .csv File and Remove Duplicates

От

Craig Ringer

Дата:

12 августа 2011 г., 02:21:03

On 12/08/2011 7:13 AM, Rich Shepard wrote:

> I have a .csv file of approximately 10k rows to copy into this table. My
> two questions which have not been answered by reference to my postgres
> reference book or Google searches are:
>
> 1) Will the sequence automatically add the nextval() to each new record as
> the copy command runs?

No, AFAIK COPY input needs to match the table structure and can't have
default fields etc.

> 2) Many of these rows almost certainly are already in the table. I would
> like to remove duplicates either during the COPY command or immediately
> after. I'm considering copying the new data into a clone of the table then
> running a SELECT to add only those rows in the new cloned table to the
> existing table.

Rather than removing them after, I'd recommend COPYing into a temporary
staging table, then doing an

INSERT INTO realtable SELECT FROM stagingtable LEFT OUTER JOIN realtable
ON (conditions) WHERE realtable.primarykey IS NULL;

... where "conditions" are whatever rules you use to decide that a row
in the real table is the same as a row in the staging table.

In other words: Only insert a row into the final destination table if it
does not already exist in the final destination table.

--
Craig Ringer

Re: COPY from .csv File and Remove Duplicates

От

"David Johnston"

Дата:

12 августа 2011 г., 02:32:57

>>  There is no true key, only an artificial key so I can ensure that rows
are
>> unique. That's in the main table with the 50K rows. No key column in the
>> .csv file.

If you have no true key then you have no way to ensure uniqueness.  By
adding an artificial key two records that are otherwise duplicates would now
be considered unique.

Since you have not given data/structure for either the CSV or Main Table
more specific help is not possible but when using a serialized PK in almost
every case the table should also have a candidate key with a UNIQUE index
defined.  If you cannot explain why yours does not, and why it cannot, I
would offer that you need to gain further understanding of your data model.
It is generally wise to create a UNIQUE index on a candidate key and risk
being wrong.  At least you will be given an actual error and, in the worst
case, can always drop the UNIQUE index if indeed the "duplicate" record
should be valid; though in that situation you now have more data to input
into you model analysis and should be able to correctly modify the table to
create a new candidate key.

Slight tangent but I have an external accounting source where I know that,
with respect to the available data, duplicates can occur (a PK field is not
available).  Since I have no candidate key I am forced to use an artificial
(serial) key and take extra precautions to ensure I do not inadvertently
introduce unintentional duplicate data during import.  In my case I handle
data at the "day" level.  My source gives me every transaction for a given
date and I then modify my live table to add only the correct number of
records so that, after the merge process, I have an exact duplicate of the
data in the source file.  Thus, since I trust the source file (and cannot
enter data via any other method), I know immediately after processing that
any duplicates on a given date are expected duplicates as opposed to, say,
me accidentally importing the same file twice and thus having twice as many
records.  I also understand that if, say for reconciliation purposes, I need
to choose one of a duplicate record it does not matter, initially, which one
I choose but afterwards, if I only add records, I can ensure that I always
pick the same record in the future.  However, if I am forced to "DELETE" a
record, from a practical perspective I DELETE BOTH/ALL of the records and
then ADD back the correct number of records for that date.  Any data that
cared about the original records will now need to decide how to handle the
fact that their record may no longer be present (instead of deleting only
some of the existing records at random without knowing which ones are the
"correct" ones to delete).

This is one example I've come across where the data I am working with has
absolutely NO inherent PK that I can see but where I can trust that, for a
given dataset, I only have valid data.  I did have to assign a SERIAL PK to
my copy of the data but I also recognized where problems could occur and
mitigated them via specific processing routines.  One alternative solution
would be to simply DELETE everything for a given date and then import every
record from the source file into the main table.  I rejected that solution
because I could not afford to continually delete the existing records as
other tables claimed FK relationships to them and continually breaking (ON
DELETE SET NULL) them was unacceptable.  I still have to do so when I need
to delete a record (rare given this is accounting data) but simply adding a
new record does not affect existing records.

Whether this situation mirrors yours I do not know but I hope this brief
description is at least informative and educational for you and others.
Feedback/Thoughts are greatly welcomed.

>>  I presume what you call a staging table is what I refer to as a copy of
>> the main table, but with no key attribute.

>> Writing the SELECT statement to delete from the staging table those rows
>> that already exist in the main table is where I'm open to suggestions.

The big question to ask is how you would be able to identify a record in the
CSV file as already being on the main table (either directly or, as my above
example, indirectly)?

My use of "staging table" reflects the fact that the structure of the table
should roughly match the CSV file and NOT the "main table".  The SQL you
issue to move records from the staging table to the main table will then
account for any differences between the two.

The general idea is to load up the staging table, optionally update
"matching" records on the main table, insert non-matching records, then
truncate/clear the staging table.

The general structure for the insert would be:

INSERT INTO maintable (cols)
SELECT cols FROM staging WHERE staging.idcols NOT IN (SELECT
maintable.idcols FROM maintable);

There may be more efficient ways to write the query but the idea is the
same.

David J.

Re: COPY from .csv File and Remove Duplicates

От

Craig Ringer

Дата:

12 августа 2011 г., 03:04:36

On 12/08/2011 10:32 AM, David Johnston wrote:

> The general structure for the insert would be:
>
> INSERT INTO maintable (cols)
> SELECT cols FROM staging WHERE staging.idcols NOT IN (SELECT
> maintable.idcols FROM maintable);
>
> There may be more efficient ways to write the query but the idea is the
> same.

Yeah... I'd favour an EXISTS test or a join.

INSERT INTO maintable (cols)
SELECT cols FROM staging WHERE NOT EXISTS (SELECT
1 FROM maintable WHERE maintable.idcol = staging.idcol);

... as the NOT IN(...) test can have less than lovely behavior for large
key sets.

--
Craig Ringer

Re: COPY from .csv File and Remove Duplicates

От

Rich Shepard

Дата:

12 августа 2011 г., 15:27:16

On Thu, 11 Aug 2011, David Johnston wrote:

> If you have duplicates with matching real keys inserting into a staging
> table and then moving new records to the final table is your best option
> (in general it is better to do a two-step with a staging table since you
> can readily use Postgresql to perform any intermediate translations) As
> for the import itself,

   It was probably a couple of days extracting very messy data from Excel
spreadsheets and writing python and awk scripts to transform them that
caused me to miss the obvious: the multi-column primary key that I intended
to implement in the base table.

   Trying to add a compound primary key using (loc_name, sample_date, param)
shows there are duplicates in the original data. While there are many slight
variations on the SELECT syntax for finding duplicates based on a single
column, I've not found working syntax for finding duplicate rows based on
the values in all three columns.

   A pointer to the appropriate syntax for retrieving the entire row when
count(loc_name, sample_date, param) > 1 would be much appreciated.

Rich

Re: COPY from .csv File and Remove Duplicates

От

David Johnston

Дата:

12 августа 2011 г., 16:10:37

>  A pointer to the appropriate syntax for retrieving the entire row when
> count(loc_name, sample_date, param) > 1 would be much appreciated.
>
> Rich
>

Select *
From table
Natural Inner join (
SELECT loc_name, sample_date, param, Count(*) as duplicate_count
FROM table
Group by loc_name, sample_date, param
) grouped
Where duplicate_count > 1
;

You first group and count on the candidate key and then effectively self-joint that result back onto the original
table. natural join is short-hand for cases where the two joining table use the same name for semantically identical
field. Much easier than saying "t1.field1 = t2.field1 AND t1.field2 = t2.field2 AND etc..." 

David J.

Re: COPY from .csv File and Remove Duplicates

От

Rich Shepard

Дата:

12 августа 2011 г., 16:39:23

On Fri, 12 Aug 2011, David Johnston wrote:

> Select *
> From table
> Natural Inner join (
> SELECT loc_name, sample_date, param, Count(*) as duplicate_count
> FROM table
> Group by loc_name, sample_date, param
> ) grouped
> Where duplicate_count > 1
> ;

David,

   Thank you. I was close in my attempts, but not sufficiently close. This
works.

Much appreciated,

Rich

Re: COPY from .csv File and Remove Duplicates

От

Rich Shepard

Дата:

12 августа 2011 г., 18:14:45

On Fri, 12 Aug 2011, David Johnston wrote:

> Select *
> From table
> Natural Inner join (
> SELECT loc_name, sample_date, param, Count(*) as duplicate_count
> FROM table
> Group by loc_name, sample_date, param
> ) grouped
> Where duplicate_count > 1;

   Tried to use the above in an INSERT INTO statement to a copy of the main
data table. Here's what I've tried and how postgres responds:

INSERT INTO chem_too
    (lab_nbr, loc_name, sample_date, param, quant, units, qa_qc,
    easting, northing, remark)
SELECT *
FROM chemistry
Natural Inner join (
    SELECT loc_name, sample_date, param, Count(*) as duplicate_count
    FROM chemistry
    GROUP BY loc_name, sample_date, param) grouped
    WHERE duplicate_count > 1;

psql -f cp_dups.sql nevada
psql:cp_dups.sql:10: ERROR:  INSERT has more expressions than target columns
LINE 4: SELECT *
                ^
   The INSERT INTO clause lists all columns in both tables so I fail to
understand to what the error message refers.

Rich

Re: COPY from .csv File and Remove Duplicates

От

"David Johnston"

Дата:

12 августа 2011 г., 18:32:20

INSERT INTO chem_too
    (lab_nbr, loc_name, sample_date, param, quant, units, qa_qc,
    easting, northing, remark)
SELECT *
FROM chemistry
Natural Inner join (
    SELECT loc_name, sample_date, param, Count(*) as duplicate_count
    FROM chemistry
    GROUP BY loc_name, sample_date, param) grouped
    WHERE duplicate_count > 1;

psql -f cp_dups.sql nevada
psql:cp_dups.sql:10: ERROR:  INSERT has more expressions than target columns
LINE 4: SELECT *
                ^
   The INSERT INTO clause lists all columns in both tables so I fail to
understand to what the error message refers.

Rich

----------------------------------------------------------------------
The INSERT and SELECT portions of the query are independent; the column
listing in the INSERT does not affect the select.  The only thing that
matters is that the DATA TYPE of the matching pairs are the same.  For
instance:

INSERT INTO table1(field1, field2)
SELECT 'One' AS fieldA, 'Two'

Would work assuming that both field1 and field2 are text; fieldA gets
inserted into field1 and the unnamed second column gets inserted into
field2.

SELECT * --> means uses every column from the FROM/JOIN tables.  Try issuing
the SELECT by itself and see what columns (and in what order) it returns.
In this case it will, at minimum, return a "duplicate_count" column which is
not going to be in the "chem_too" table.  Thus, you need to replace the "*"
in the SELECT with the specific columns that correspond to the columns
listed in to INSERT portion of the query.  Likely this means

INSERT INTO chem_too (lab_nbr, loc_name, sample_date, param, quant, units,
qa_qc, easting, northing, remark)
SELECT lab_nbr, loc_name, sample_date, param, quant, units, qa_qc, easting,
northing, remark -- in the original "*" expansion the duplicate_count field
is present AND quite possibly the order of the fields is messed up
FROM chemistry
NATURAL JOIN ( ... ) WHERE duplicate_count > 1;

IF the chemistry table is a true copy of the chem_too table you can shorten
the above in two ways; though using NATURAL JOIN may cause them to fail due
to column order.  The above is the safest way to write the query -
explicitly specify all fields in both the INSERT and the SELECT portions of
the query.

INSERT INTO chem_too (lab_nbr, loc_name, sample_date, param, quant, units,
qa_qc, easting, northing, remark)
SELECT chemistry.*
FROM chemistry NATURAL JOIN ( ... ) WHERE duplicate_count > 1

OR, even further,

INSERT INTO chem_too
SELECT chemistry.*
FROM chemistry NATURAL JOIN ( ... ) WHERE duplicate_count > 1;

David J.

Re: COPY from .csv File and Remove Duplicates [RESOLVED]

От

Rich Shepard

Дата:

12 августа 2011 г., 22:47:12

On Fri, 12 Aug 2011, David Johnston wrote:

> Thus, you need to replace the "*" in the SELECT with the specific columns
> that correspond to the columns listed in to INSERT portion of the query.

David,

   Mea culpa! I should have seen this myself. Now the query works and I have
about 6K duplicate pairs to fix.

Thanks,

Rich

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: COPY from .csv File and Remove Duplicates