Обсуждение: Recommended Protocol: Adding Rows to Table
The data originated in a spreadsheet and, based on my experience, contains duplicate records. After reformatting there are 143,260 rows to insert in the table. The approach I tried seems to have problems (explained below) and I would like to learn the proper way to insert rows in either an empty table or one with existing rows since I'll need to do this procedure for my projects. The table was created with the primary key and I used INSERT INTO ... to load the data. Many duplicate records, so I split the file into smaller ones and re-ran the command to load them. I'd then remove the reported duplicate rows from the source (text) file. Between runs, I'd issue the DROP TABLE command within psql and check it was empty using 'select count(*) from waterchem;'. It appeared that I removed duplicates from the first couple of smaller files so I combined them into one file named ok.sql. But, when I tested the combined file it, too, reported many duplicate records. Something wrong here. Perhaps a better approach is to put the CREATE TABLE command above the INSERT INTO section of the file (without specifying a PK), load that using the command 'psql -d <database> -f waterchem.sql', then add the PK and remove duplicates as postgres presents them. Or, perhaps there is a much more efficient way to accomplish this task. Rather than my flailing around and spending a lot of time failing to load all non-duplicate rows into the table I'd like to learn the proper way to accomplish this task. Suggestions, recommendations, and your experiences are requested. TIA, Rich
On Nov 25, 2011, at 11:05, Rich Shepard <rshepard@appl-ecosys.com> wrote: > The data originated in a spreadsheet and, based on my experience, contains > duplicate records. After reformatting there are 143,260 rows to insert in > the table. The approach I tried seems to have problems (explained below) and > I would like to learn the proper way to insert rows in either an empty table > or one with existing rows since I'll need to do this procedure for my > projects. > > The table was created with the primary key and I used INSERT INTO ... to > load the data. Many duplicate records, so I split the file into smaller ones > and re-ran the command to load them. I'd then remove the reported duplicate > rows from the source (text) file. Between runs, I'd issue the DROP TABLE > command within psql and check it was empty using 'select count(*) from > waterchem;'. > > It appeared that I removed duplicates from the first couple of smaller > files so I combined them into one file named ok.sql. But, when I tested the > combined file it, too, reported many duplicate records. Something wrong > here. > > Perhaps a better approach is to put the CREATE TABLE command above the > INSERT INTO section of the file (without specifying a PK), load that using > the command 'psql -d <database> -f waterchem.sql', then add the PK and > remove duplicates as postgres presents them. Or, perhaps there is a much > more efficient way to accomplish this task. > > Rather than my flailing around and spending a lot of time failing to load > all non-duplicate rows into the table I'd like to learn the proper way to > accomplish this task. Suggestions, recommendations, and your experiences are > requested. > > TIA, > > Rich > > Simplistically you load all the data into a staging table that has no natural primary key and then write a query that willresult in only a single record for whatever you define as a primary key. Insert the results of that query into the finaltable. If you only use a subset of columns to uniquely determine the PK than you need to decide how to resolve situations wherethe PK fields are duplicated but the extra fields are unique. That step is why the first description is simplistic. David J.
On Fri, 25 Nov 2011, David Johnston wrote: > Simplistically you load all the data into a staging table that has no > natural primary key and then write a query that will result in only a > single record for whatever you define as a primary key. Insert the > results of that query into the final table. David, I was leaning toward doing this. > If you only use a subset of columns to uniquely determine the PK than you > need to decide how to resolve situations where the PK fields are > duplicated but the extra fields are unique. That step is why the first > description is simplistic. Well, that's point of a PK, isn't it? There should be only one row with that specific set of column values. The additional columns are unique to that set of PK values. Thanks, Rich
On 11/25/2011 08:49 AM, Rich Shepard wrote: > On Fri, 25 Nov 2011, David Johnston wrote: > >> Simplistically you load all the data into a staging table that has no >> natural primary key and then write a query that will result in only a >> single record for whatever you define as a primary key. Insert the >> results of that query into the final table. > > David, > > I was leaning toward doing this. > >> If you only use a subset of columns to uniquely determine the PK than you >> need to decide how to resolve situations where the PK fields are >> duplicated but the extra fields are unique. That step is why the first >> description is simplistic. > > Well, that's point of a PK, isn't it? There should be only one row with > that specific set of column values. The additional columns are unique to > that set of PK values. Gotta love an optimist:) > > Thanks, > > Rich > -- Adrian Klaver adrian.klaver@gmail.com
On Fri, Nov 25, 2011 at 11:05 AM, Rich Shepard <rshepard@appl-ecosys.com> wrote:
The data originated in a spreadsheet and, based on my experience, contains
duplicate records. After reformatting there are 143,260 rows to insert in
the table. The approach I tried seems to have problems (explained below) and
I would like to learn the proper way to insert rows in either an empty table
or one with existing rows since I'll need to do this procedure for my
projects.
The table was created with the primary key and I used INSERT INTO ... to
load the data. Many duplicate records, so I split the file into smaller ones
and re-ran the command to load them. I'd then remove the reported duplicate
rows from the source (text) file. Between runs, I'd issue the DROP TABLE
command within psql and check it was empty using 'select count(*) from
waterchem;'.
It appeared that I removed duplicates from the first couple of smaller
files so I combined them into one file named ok.sql. But, when I tested the
combined file it, too, reported many duplicate records. Something wrong
here.
Perhaps a better approach is to put the CREATE TABLE command above the
INSERT INTO section of the file (without specifying a PK), load that using
the command 'psql -d <database> -f waterchem.sql', then add the PK and
remove duplicates as postgres presents them. Or, perhaps there is a much
more efficient way to accomplish this task.
Rather than my flailing around and spending a lot of time failing to load
all non-duplicate rows into the table I'd like to learn the proper way to
accomplish this task. Suggestions, recommendations, and your experiences are
requested.
Why don't you first load the data into a table (no primary key), then use SQL to find your dups?
once loaded:
SELECT <primary_key_column>, count(1) from <table> group by 1 having count(1) > 1;
At least then, you'll really know what you're in for. You can either script a DELETE or... whatever you want to do, once clean, you can add the PK.
--
Scott
TIA,
Rich
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, 25 Nov 2011, Scott Mead wrote: > Why don't you first load the data into a table (no primary key), then use > SQL to find your dups? > > once loaded: > SELECT <primary_key_column>, count(1) from <table> group by 1 having > count(1) > 1; > > At least then, you'll really know what you're in for. You can either > script a DELETE or... whatever you want to do, once clean, you can add the > PK. Scott, I'll do this. First I need to find why 2 rows (of 143260) are not in the table after reading the source file. I have those 2 rows somewhat isolated as one's in the third quarter of the file and one in the fourth quarter. Thanks, Rich