Обсуждение: data import - duplicates

Поиск
Список
Период
Сортировка

data import - duplicates

От
brian
Дата:
I have ~350K rows of sample data that has come to me in 64 text files
(CSV) that I'd like to import into a new database. Each file can be
considered its own category and is so named. That is, each file name
will be inserted into a separate categories table. I'd like to relate
each row to its category. However, while the rows are unique in each
file, some rows are found in more than one file.

I also must parse some of the fields into separate columns, but that's
no big deal. But it means that I must do some pre-processing on these
files, in any case.

After some thought, I thought I might brute-force the problem with Perl
by reading each line of each file into an assoc. array unless it's
already in it, in which case I'd append the "key" based on the list of
categories that line is found in (I'd still need to parse outthe keys
later but I'd get rid of the dupes). Each array row would be like so:

'key' => '1,4,etc.', 'text' => 'a line'

Aside from the fact that the array search would become ridiculously
large after just a couple of files, I realised that this is a
non-starter for the simple fact that this data comprises less than 25%
of the eventual total. So refactoring it in this way would be a waste of
time (there will probably be dupes in the files to come).

So, I'd like to instead parse out my columns properly and write each
line (along with its category key) to a new, single file to be copied
into a working table. ONce I've done so, is there an efficient method I
can use to select all duplicates (save for the category key) into a set
from which I could then select into the final table (and insert the keys
into the category join table)?

For example (pk is the PK from the working table and ck is the category
key), my dupes query on the working table would give the following set:

pk  ck
1   1  a  a  a  a
2   3  a  a  a  a
3   3  b  b  b  b
4   7  b  b  b  b
5   4  a  a  a  a

I would then want to insert just the unique rows into the final table
yet add all of the the PKs and CKs to the category join table. After
that was done, I'd delete all of these from the working table and then
move the unique rows that are left to the final table (and insert the
keys into the join table).

I hope that makes sense. I'm not looking for anyone to do my homework
for me; I'm sure I could fix up a tasty function for this (the data is
destined for MySQL, alas, but I'll be damned if I won't use PG for the
heavy lifting). What I'm really looking for is a handy way to grab all
of those dupes.

Re: data import - duplicates

От
"Webb Sprague"
Дата:
I haven't tested but this is what I would do (uses arrays, which are
handy when you need them), with the names changed to protect the
innocent:

begin;
-- create a table with some duplicates in one of the columns (y is ck);
wsprague=# select x, x%4 as y into temp fbar from generate_series(1,10) as x(a);
SELECT

wsprague=# select * from fbar;
 x  | y
----+---
  1 | 1
  2 | 2
  3 | 3
  4 | 0
  5 | 1
  6 | 2
  7 | 3
  8 | 0
  9 | 1
 10 | 2
(10 rows)

-- create a table with the pk, an array of the duplicate keys, and the
length of that array
select y, x_list, array_upper(x_list, 1) as x_len into baz from
(select y, array_accum(x) as x_list from fbar group by y ) a ;

wsprague=# select * from baz;
 y |  x_list  | x_len
---+----------+-------
 3 | {3,7}    |     2
 2 | {2,6,10} |     3
 1 | {1,5,9}  |     3
 0 | {4,8}    |     2
(4 rows)

-- delete all rows that don't have ck in the first element of the pk list
wsprague=# delete from fbar where not exists (select 1 from baz where
fbar.x=baz.x_list[1]);DELETE 6
wsprague=# select * from fbar;
 x | y
---+---
 1 | 1
 2 | 2
 3 | 3
 4 | 0
(4 rows)

commit;

I hope that makes sense.  It should be easy to make into a function
(like clean_table(table_name text, pk_name text, dup_key_name text).
I don't know how well it will work with indexes.  You could probably
even write a function to do the entire import inside postgres,
starting with a copy and moving to merging tables and ending with some
consistency checks, and thus benefit from  transaction isolation of
the whole process....

HTH

On Sat, Mar 8, 2008 at 9:42 PM, brian <brian@zijn-digital.com> wrote:
> I have ~350K rows of sample data that has come to me in 64 text files
>  (CSV) that I'd like to import into a new database. Each file can be
>  considered its own category and is so named. That is, each file name
>  will be inserted into a separate categories table. I'd like to relate
>  each row to its category. However, while the rows are unique in each
>  file, some rows are found in more than one file.
>
>  I also must parse some of the fields into separate columns, but that's
>  no big deal. But it means that I must do some pre-processing on these
>  files, in any case.
>
>  After some thought, I thought I might brute-force the problem with Perl
>  by reading each line of each file into an assoc. array unless it's
>  already in it, in which case I'd append the "key" based on the list of
>  categories that line is found in (I'd still need to parse outthe keys
>  later but I'd get rid of the dupes). Each array row would be like so:
>
>  'key' => '1,4,etc.', 'text' => 'a line'
>
>  Aside from the fact that the array search would become ridiculously
>  large after just a couple of files, I realised that this is a
>  non-starter for the simple fact that this data comprises less than 25%
>  of the eventual total. So refactoring it in this way would be a waste of
>  time (there will probably be dupes in the files to come).
>
>  So, I'd like to instead parse out my columns properly and write each
>  line (along with its category key) to a new, single file to be copied
>  into a working table. ONce I've done so, is there an efficient method I
>  can use to select all duplicates (save for the category key) into a set
>  from which I could then select into the final table (and insert the keys
>  into the category join table)?
>
>  For example (pk is the PK from the working table and ck is the category
>  key), my dupes query on the working table would give the following set:
>
>  pk  ck
>  1   1  a  a  a  a
>  2   3  a  a  a  a
>  3   3  b  b  b  b
>  4   7  b  b  b  b
>  5   4  a  a  a  a
>
>  I would then want to insert just the unique rows into the final table
>  yet add all of the the PKs and CKs to the category join table. After
>  that was done, I'd delete all of these from the working table and then
>  move the unique rows that are left to the final table (and insert the
>  keys into the join table).
>
>  I hope that makes sense. I'm not looking for anyone to do my homework
>  for me; I'm sure I could fix up a tasty function for this (the data is
>  destined for MySQL, alas, but I'll be damned if I won't use PG for the
>  heavy lifting). What I'm really looking for is a handy way to grab all
>  of those dupes.
>
>  --
>  Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>  To make changes to your subscription:
>  http://www.postgresql.org/mailpref/pgsql-general
>

Re: data import - duplicates

От
brian
Дата:

Webb Sprague wrote:
> I haven't tested but this is what I would do (uses arrays, which are
> handy when you need them), with the names changed to protect the
> innocent:
>
> begin;
> -- create a table with some duplicates in one of the columns (y is ck);
> wsprague=# select x, x%4 as y into temp fbar from generate_series(1,10) as x(a);
> SELECT
>
> wsprague=# select * from fbar;
>  x  | y
> ----+---
>   1 | 1
>   2 | 2
>   3 | 3
>   4 | 0
>   5 | 1
>   6 | 2
>   7 | 3
>   8 | 0
>   9 | 1
>  10 | 2
> (10 rows)
>
> -- create a table with the pk, an array of the duplicate keys, and the
> length of that array
> select y, x_list, array_upper(x_list, 1) as x_len into baz from
> (select y, array_accum(x) as x_list from fbar group by y ) a ;
>
> wsprague=# select * from baz;
>  y |  x_list  | x_len
> ---+----------+-------
>  3 | {3,7}    |     2
>  2 | {2,6,10} |     3
>  1 | {1,5,9}  |     3
>  0 | {4,8}    |     2
> (4 rows)
>
> -- delete all rows that don't have ck in the first element of the pk list
> wsprague=# delete from fbar where not exists (select 1 from baz where
> fbar.x=baz.x_list[1]);DELETE 6
> wsprague=# select * from fbar;
>  x | y
> ---+---
>  1 | 1
>  2 | 2
>  3 | 3
>  4 | 0
> (4 rows)
>
> commit;
>
> I hope that makes sense.  It should be easy to make into a function
> (like clean_table(table_name text, pk_name text, dup_key_name text).
> I don't know how well it will work with indexes.  You could probably
> even write a function to do the entire import inside postgres,
> starting with a copy and moving to merging tables and ending with some
> consistency checks, and thus benefit from  transaction isolation of
> the whole process....
>


Hey, neat idea. I changed things a little:

test=# SELECT * FROM table1;
  id | ck | content
----+----+---------
   1 |  1 | string1
   2 |  2 | string2
   3 |  2 | string3
   4 |  4 | string1
   5 |  5 | string1
   6 |  1 | string3
   7 |  2 | string1
(7 rows)

test=# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (
test-# SELECT array_accum(ck) AS key_list, content FROM table1 GROUP BY
content
test=# ) a;
SELECT

test=# SELECT * FROM table2;
  key_list  | key_num | content
-----------+---------+---------
  {2,1}     |       2 | string3
  {1,4,5,2} |       4 | string1
  {2}       |       1 | string2
(3 rows)


I had to modify your DELETE suggestion a bit because the table1.id (your
x) is non-unique (although it must be within a given key_list). It'll be
a performance hog though because it's comparing each string.


test-# DELETE FROM table1 WHERE NOT EXISTS (
test-# SELECT 1 FROM table2
test-# WHERE table1.content = table2.content
test-# AND table1.ck = table2.key_list[1]
test-# );
DELETE 4

test=# SELECT * FROM table1;
  id | ck | content
----+----+---------
   1 |  1 | string1
   2 |  2 | string2
   3 |  2 | string3
(3 rows)

 From here I can drop the table1.ck column. But, in order to place the
table2.key_list ints into a join table, I need the PK from table1. I
have the content column with which to fetch that from table1 but that
hardly seems an efficient method. So, I modified your idea for gathering
the dupe CKs a little:

test-# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (
test-# SELECT array_cat(ARRAY[array_accum(id)], ARRAY[array_accum(ck)])
AS key_list, content
test-# FROM table1 GROUP BY content) a;
SELECT

test=# select * from table2;
        key_list        | key_num | content
-----------------------+---------+---------
  {{3,6},{2,1}}         |       2 | string3
  {{1,4,5,7},{1,4,5,2}} |       2 | string1
  {{2},{2}}             |       2 | string2
(3 rows)

Huzzah! IOW, I accumulate both the PKs and ck from table1 for each
string. Then, I can ditch all but the first PK, which should make
transferring these values in to the join table a snap.

# lightbulb comes on ...

Actually, what I'll do is remove the extraneous PKs and use them to
DELETE FROM table1. Much more efficient, that.

Thanks for your ideas. And if anyone else has any suggestion please feel
free to chime in. When I get the rest of the data I'll post back here on
how it went in case google points someone here in the future.

b