Обсуждение: Recommended Protocol: Adding Rows to Table

Поиск
Список
Период
Сортировка

Recommended Protocol: Adding Rows to Table

От
Rich Shepard
Дата:
   The data originated in a spreadsheet and, based on my experience, contains
duplicate records. After reformatting there are 143,260 rows to insert in
the table. The approach I tried seems to have problems (explained below) and
I would like to learn the proper way to insert rows in either an empty table
or one with existing rows since I'll need to do this procedure for my
projects.

   The table was created with the primary key and I used INSERT INTO ... to
load the data. Many duplicate records, so I split the file into smaller ones
and re-ran the command to load them. I'd then remove the reported duplicate
rows from the source (text) file. Between runs, I'd issue the DROP TABLE
command within psql and check it was empty using 'select count(*) from
waterchem;'.

   It appeared that I removed duplicates from the first couple of smaller
files so I combined them into one file named ok.sql. But, when I tested the
combined file it, too, reported many duplicate records. Something wrong
here.

   Perhaps a better approach is to put the CREATE TABLE command above the
INSERT INTO section of the file (without specifying a PK), load that using
the command 'psql -d <database> -f waterchem.sql', then add the PK and
remove duplicates as postgres presents them. Or, perhaps there is a much
more efficient way to accomplish this task.

   Rather than my flailing around and spending a lot of time failing to load
all non-duplicate rows into the table I'd like to learn the proper way to
accomplish this task. Suggestions, recommendations, and your experiences are
requested.

TIA,

Rich

Re: Recommended Protocol: Adding Rows to Table

От
David Johnston
Дата:
On Nov 25, 2011, at 11:05, Rich Shepard <rshepard@appl-ecosys.com> wrote:

>  The data originated in a spreadsheet and, based on my experience, contains
> duplicate records. After reformatting there are 143,260 rows to insert in
> the table. The approach I tried seems to have problems (explained below) and
> I would like to learn the proper way to insert rows in either an empty table
> or one with existing rows since I'll need to do this procedure for my
> projects.
>
>  The table was created with the primary key and I used INSERT INTO ... to
> load the data. Many duplicate records, so I split the file into smaller ones
> and re-ran the command to load them. I'd then remove the reported duplicate
> rows from the source (text) file. Between runs, I'd issue the DROP TABLE
> command within psql and check it was empty using 'select count(*) from
> waterchem;'.
>
>  It appeared that I removed duplicates from the first couple of smaller
> files so I combined them into one file named ok.sql. But, when I tested the
> combined file it, too, reported many duplicate records. Something wrong
> here.
>
>  Perhaps a better approach is to put the CREATE TABLE command above the
> INSERT INTO section of the file (without specifying a PK), load that using
> the command 'psql -d <database> -f waterchem.sql', then add the PK and
> remove duplicates as postgres presents them. Or, perhaps there is a much
> more efficient way to accomplish this task.
>
>  Rather than my flailing around and spending a lot of time failing to load
> all non-duplicate rows into the table I'd like to learn the proper way to
> accomplish this task. Suggestions, recommendations, and your experiences are
> requested.
>
> TIA,
>
> Rich
>
>

Simplistically you load all the data into a staging table that has no natural primary key and then write a query that
willresult in only a single record for whatever you define as a primary key.  Insert the results of that query into the
finaltable. 

If you only use a subset of columns to uniquely determine the PK than you need to decide how to resolve situations
wherethe PK fields are duplicated but the extra fields are unique.  That step is why the first description is
simplistic.

David J.

Re: Recommended Protocol: Adding Rows to Table

От
Rich Shepard
Дата:
On Fri, 25 Nov 2011, David Johnston wrote:

> Simplistically you load all the data into a staging table that has no
> natural primary key and then write a query that will result in only a
> single record for whatever you define as a primary key.  Insert the
> results of that query into the final table.

David,

   I was leaning toward doing this.

> If you only use a subset of columns to uniquely determine the PK than you
> need to decide how to resolve situations where the PK fields are
> duplicated but the extra fields are unique. That step is why the first
> description is simplistic.

   Well, that's point of a PK, isn't it? There should be only one row with
that specific set of column values. The additional columns are unique to
that set of PK values.

Thanks,

Rich

Re: Recommended Protocol: Adding Rows to Table

От
Adrian Klaver
Дата:
On 11/25/2011 08:49 AM, Rich Shepard wrote:
> On Fri, 25 Nov 2011, David Johnston wrote:
>
>> Simplistically you load all the data into a staging table that has no
>> natural primary key and then write a query that will result in only a
>> single record for whatever you define as a primary key. Insert the
>> results of that query into the final table.
>
> David,
>
> I was leaning toward doing this.
>
>> If you only use a subset of columns to uniquely determine the PK than you
>> need to decide how to resolve situations where the PK fields are
>> duplicated but the extra fields are unique. That step is why the first
>> description is simplistic.
>
> Well, that's point of a PK, isn't it? There should be only one row with
> that specific set of column values. The additional columns are unique to
> that set of PK values.

Gotta love an optimist:)

>
> Thanks,
>
> Rich
>


--
Adrian Klaver
adrian.klaver@gmail.com

Re: Recommended Protocol: Adding Rows to Table

От
Scott Mead
Дата:

On Fri, Nov 25, 2011 at 11:05 AM, Rich Shepard <rshepard@appl-ecosys.com> wrote:
 The data originated in a spreadsheet and, based on my experience, contains
duplicate records. After reformatting there are 143,260 rows to insert in
the table. The approach I tried seems to have problems (explained below) and
I would like to learn the proper way to insert rows in either an empty table
or one with existing rows since I'll need to do this procedure for my
projects.

 The table was created with the primary key and I used INSERT INTO ... to
load the data. Many duplicate records, so I split the file into smaller ones
and re-ran the command to load them. I'd then remove the reported duplicate
rows from the source (text) file. Between runs, I'd issue the DROP TABLE
command within psql and check it was empty using 'select count(*) from
waterchem;'.

 It appeared that I removed duplicates from the first couple of smaller
files so I combined them into one file named ok.sql. But, when I tested the
combined file it, too, reported many duplicate records. Something wrong
here.

 Perhaps a better approach is to put the CREATE TABLE command above the
INSERT INTO section of the file (without specifying a PK), load that using
the command 'psql -d <database> -f waterchem.sql', then add the PK and
remove duplicates as postgres presents them. Or, perhaps there is a much
more efficient way to accomplish this task.

 Rather than my flailing around and spending a lot of time failing to load
all non-duplicate rows into the table I'd like to learn the proper way to
accomplish this task. Suggestions, recommendations, and your experiences are
requested.

Why don't you first load the data into a table (no primary key), then use SQL to find your dups? 

once loaded:
  SELECT <primary_key_column>, count(1) from <table> group by 1 having count(1) > 1;

 At least then, you'll really know what you're in for.  You can either script a DELETE or... whatever you want to do, once clean, you can add the PK.

--
Scott

TIA,

Rich

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Recommended Protocol: Adding Rows to Table

От
Rich Shepard
Дата:
On Fri, 25 Nov 2011, Scott Mead wrote:

> Why don't you first load the data into a table (no primary key), then use
> SQL to find your dups?
>
> once loaded:
>  SELECT <primary_key_column>, count(1) from <table> group by 1 having
> count(1) > 1;
>
> At least then, you'll really know what you're in for.  You can either
> script a DELETE or... whatever you want to do, once clean, you can add the
> PK.

Scott,

   I'll do this. First I need to find why 2 rows (of 143260) are not in the
table after reading the source file. I have those 2 rows somewhat isolated
as one's in the third quarter of the file and one in the fourth quarter.

Thanks,

Rich